| Literature DB >> 28986523 |
Yi Wang1, Yi Li1, Xiaoyu Liu1, Weilin Pu1, Xiaofeng Wang2, Jiucun Wang2, Momiao Xiong1,3, Yin Yao Shugart4,5, Li Jin6.
Abstract
Testing dependence/correlation of two variables is one of the fundamental tasks in statistics. In this work, we proposed an efficient method for nonlinear dependence of two continuous variables (X and Y). We addressed this research question by using BNNPT (Bagging Nearest-Neighbor Prediction independence Test, software available at https://sourceforge.net/projects/bnnpt/). In the BNNPT framework, we first used the value of X to construct a bagging neighborhood structure. We then obtained the out of bag estimator of Y based on the bagging neighborhood structure. The square error was calculated to measure how well Y is predicted by X. Finally, a permutation test was applied to determine the significance of the observed square error. To evaluate the strength of BNNPT compared to seven other methods, we performed extensive simulations to explore the relationship between various methods and compared the false positive rates and statistical power using both simulated and real datasets (Rugao longevity cohort mitochondrial DNA haplogroups and kidney cancer RNA-seq datasets). We concluded that BNNPT is an efficient computational approach to test nonlinear correlation in real world applications.Entities:
Year: 2017 PMID: 28986523 PMCID: PMC5630623 DOI: 10.1038/s41598-017-12783-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Simulation power in nine sample functions.
| N = 50, X ~U(−1,1) | BNNPT | Pearson | Spearman | Kendall | Hoeffding | Distance | CANOVA | MIC |
|---|---|---|---|---|---|---|---|---|
| y = 0 + N(0,1) | 0.050 | 0.058 | 0.053 | 0.055 | 0.068 | 0.059 | 0.053 | 0.046 |
| y = x + N(0,1) | 0.839 |
| 0.951 | 0.950 | 0.940 | 0.946 | 0.544 | 0.593 |
| y = | 0.861 |
| 0.949 | 0.946 | 0.935 | 0.946 | 0.580 | 0.608 |
| y = sin( | 0.957 | 0.937 | 0.912 | 0.904 |
| 0.962 | 0.742 | 0.805 |
| y = sin(3 |
| 0.180 | 0.182 | 0.190 | 0.201 | 0.174 | 0.694 | 0.423 |
| y = cos( |
| 0.066 | 0.079 | 0.073 | 0.690 | 0.653 | 0.726 | 0.649 |
| y = cos(2 |
| 0.060 | 0.065 | 0.066 | 0.151 | 0.109 | 0.720 | 0.570 |
| y = cos(3 |
| 0.064 | 0.072 | 0.070 | 0.109 | 0.093 | 0.688 | 0.394 |
The bold means the first place result of all methods compared.
The p-value comparison of benchmarked methods in Rugao longevity cohort data.
| mtDNA haplogroup | BNNPT | Pearson | Spearman | Kendall | Hoeffding* | Distance | Canova |
|---|---|---|---|---|---|---|---|
| D | 0.998 | 0.423 | 0.567 | 0.567 | 1.000 | 0.421 | 0.541 |
| D4 | 0.655 | 0.175 | 0.358 | 0.357 | 1.000 | 0.162 | 0.486 |
| D4a | 0.519 | 0.809 | 0.888 | 0.888 | 1.000 | 0.951 | 0.485 |
| D4b | 0.568 | 0.647 | 0.784 | 0.784 | 1.000 | 0.786 | 0.482 |
| D4b2 | 0.981 | 0.376 | 0.449 | 0.449 | 1.000 | 0.419 | 0.508 |
| D4b2b | 0.799 | 0.580 | 0.548 | 0.548 | 1.000 | 0.728 | 0.426 |
| D5 | 0.188 | 0.568 | 0.694 | 0.694 | 1.000 | 0.782 | 0.502 |
| M12 | 0.907 | 0.739 | 0.605 | 0.605 | 1.000 | 0.888 | 0.527 |
| G | 0.303 | 0.933 | 0.723 | 0.723 | 1.000 | 0.943 | 0.507 |
| G2 | 0.149 | 0.161 | 0.232 | 0.232 | 1.000 | 0.261 | 0.529 |
| M7 | 0.957 | 0.961 | 0.994 | 0.994 | 1.000 | 0.947 | 0.500 |
| M7b | 0.619 | 0.705 | 0.992 | 0.992 | 1.000 | 0.806 | 0.512 |
| M8 | 0.963 | 0.863 | 0.851 | 0.851 | 1.000 | 0.368 | 0.528 |
| M8a | 0.447 | 0.397 | 0.365 | 0.365 | 1.000 | 0.146 | 0.455 |
| C | 0.246 | 0.513 | 0.583 | 0.583 | 1.000 | 0.713 | 0.501 |
| M9 | 0.541 |
| 0.054 | 0.054 | 1.000 |
| 0.433 |
| M10 | 0.347 | 0.793 | 0.963 | 0.963 | 1.000 | 0.866 | 0.503 |
| N9 | 0.313 |
| 0.060 | 0.060 | 1.000 |
| 0.435 |
| N9a | 0.352 | 0.084 | 0.193 | 0.193 | 1.000 | 0.130 | 0.471 |
| A |
| 0.371 | 0.530 | 0.530 | 1.000 | 0.532 | 0.484 |
| F | 0.224 | 0.113 | 0.065 | 0.065 | 1.000 | 0.170 | 0.434 |
| F1 | 0.442 | 0.239 | 0.127 | 0.127 | 1.000 | 0.280 | 0.466 |
| B | 0.180 | 0.388 | 0.368 | 0.368 | 1.000 | 0.451 | 0.544 |
| B5 | 0.656 | 0.201 | 0.524 | 0.524 | 1.000 | 0.188 | 0.501 |
| B5a | 0.321 | 0.189 | 0.653 | 0.653 | 1.000 | 0.177 | 0.547 |
| B5b | 0.709 | 0.654 | 0.740 | 0.740 | 1.000 | 0.479 | 0.508 |
| B4a |
| 0.097 | 0.086 | 0.086 | 1.000 | 0.109 | 0.499 |
| B4b | 0.746 | 0.540 | 0.833 | 0.833 | 1.000 | 0.544 | 0.391 |
The significant (significance level = 0.05) pvalues of methods were marked in bold.
*The genotype data X (28 mitochondrial haplogroups data) were drawn from a discontinuous distribution, Hoeffding’s independence test may have a defect for discontinuous distributions.
Comparison of computing time and detected significant genes numbers of all methods in kidney cancer dataset (the significance level α = 2.435e-06).
| Kidney cancer dataset | BNNPT | Pearson | Spearman | Kendall | Hoeffding | Distance | CANOVA | MIC |
|---|---|---|---|---|---|---|---|---|
| The number of unique genes (reported in pubmed) |
| 15 (1) | 41 (1) | 0 (0) | 0 (0) | 120 (1) | 8 (1) | 3 (0) |
| Significant number | 10617 | 8239 |
| 11569 | 4953 | 10946 | 5901 | 8081 |
| Computing time (seconds) | 80* |
| 0.0025 | 0.0082 | 1.8 | ~5000 | 20 | 0.027 |
*In order to compare the computing time, the number of permutations of BNNPT is set to 10,000,000 times. If the number of permutations of BNNPT is set to 100,000 times, it only needs 1 seconds. The bold means the first place results of all methods compared. The Computing time was recorded between 1 gene and 604 samples.
Reported significant genes detected only by BNNPT and corresponding p-value (the rank of the p-value of each gene from each method) of all methods in kidney cancer dataset (α = 2.435e-06).
| Gene | BNNPT | Pearson | Spearman | Kendall | Hoeffding | Distance | CANOVA | MIC* |
|---|---|---|---|---|---|---|---|---|
|
| 0.0E + 00 (1) | 2.9E-01 (16358) | 7.7E-02 (16536) | 7.7E-02 (16537) | 5.5E-01 (15039) | 2.0E-02 (16518) | 1.6E-05 (6180) | 2.2E-01 (8577) |
|
| 0.0E + 00 (1) | 3.8E-03 (11660) | 1.0E-05 (12081) | 1.2E-05 (12081) | 3.4E-02 (11141) | 6.0E-06 (11295) | 3.7E-02 (9785) | 2.2E-01 (8606) |
|
| 0.0E + 00 (1) | 9.3E-01 (19986) | 2.1E-03 (14118) | 2.2E-03 (14119) | 6.1E-02 (11730) | 6.4E-05 (12684) | 2.4E-02 (8847) | 2.1E-01 (9771) |
|
| 0.0E + 00 (1) | 9.1E-04 (10773) | 3.5E-05 (12478) | 3.9E-05 (12478) | 3.9E-02 (11281) | 8.0E-06 (11504) | 4.5E-02 (10423) | 2.1E-01 (9539) |
|
| 0.0E + 00 (1) | 7.0E-03 (12077) | 1.8E-02 (15348) | 1.8E-02 (15348) | 1.5E-01 (12762) | 6.0E-06 (11295) | 6.2E-02 (12087) | 2.2E-01 (8104) |
|
| 0.0E + 00 (1) | 1.9E-01 (15499) | 1.8E-01 (17402) | 1.8E-01 (17402) | 4.0E-01 (13963) | 1.0E-04 (12905) | 3.9E-02 (9979) | 2.2E-01 (8193) |
|
| 0.0E + 00 (1) | 4.8E-01 (17676) | 2.0E-01 (17561) | 2.0E-01 (17561) | 2.2E-01 (13232) | 4.6E-05 (12509) | 3.9E-02 (9912) | 2.0E-01 (9969) |
|
| 1.0E-07 (9294) | 3.7E-01 (16940) | 8.0E-02 (16576) | 8.0E-02 (16576) | 1.8E-01 (13012) | 1.8E-04 (13225) | 4.2E-02 (10223) | 2.1E-01 (9719) |
|
| 1.0E-07 (9294) | 3.4E-02 (13319) | 9.3E-01 (20080) | 9.3E-01 (20080) | 4.8E-01 (14582) | 1.1E-03 (14355) | 4.8E-02 (10731) | 1.9E-01 (11391) |
|
| 2.0E-07 (9613) | 3.1E-03 (11494) | 1.8E-03 (14056) | 1.9E-03 (14056) | 1.1E-01 (12409) | 6.0E-06 (11295) | 7.1E-02 (12898) | 2.2E-01 (8447) |
|
| 2.0E-07 (9613) | 1.2E-01 (14807) | 8.2E-02 (16593) | 8.2E-02 (16593) | 5.1E-01 (14831) | 7.9E-04 (14140) | 8.8E-02 (13866) | 2.1E-01 (9788) |
|
| 2.0E-07 (9613) | 7.2E-05 (9558) | 2.3E-04 (13181) | 2.5E-04 (13181) | 3.7E-01 (13847) | 1.2E-05 (11758) | 3.5E-02 (9659) | 1.9E-01 (12135) |
|
| 4.0E-07 (9878) | 7.8E-01 (19403) | 6.4E-05 (12689) | 7.0E-05 (12689) | 5.6E-02 (11628) | 1.2E-03 (14469) | 6.5E-02 (12355) | 2.0E-01 (10651) |
|
| 8.0E-07 (10173) | 1.3E-04 (9806) | 7.9E-05 (12754) | 8.6E-05 (12754) | 1.6E-01 (12867) | 5.2E-05 (12582) | 6.0E-02 (11834) | 1.8E-01 (12571) |
|
| 8.0E-07 (10173) | 2.7E-02 (13117) | 6.6E-05 (12698) | 7.2E-05 (12698) | 9.7E-02 (12270) | 3.1E-04 (13552) | 1.1E-01 (14453) | 2.0E-01 (10794) |
*As the p-value of MIC is calculated by table lookup, so we just list the MIC value (if MIC > 0.22378, then the p-value of MIC < 2.435e-06). The genes reported in pubmed was shown in bold italics. The rank of the p-value of each gene from each method were also shown above and the ties of p-value ranks were replaced by their minimum respectively.
Figure 1The scatter lot and probability density distribution of 15 gene expressions (reported significant genes detected only by BNNPT) between kidney-cancer and normal groups.
Figure 2The scatterplot and probability density distribution of UGT1A9 gene expression (reported significant genes detected only by CANOVA) between kidney-cancer and normal groups.
Figure 3The scatterplot and probability density distribution of HDAC1 gene expression (reported significant genes detected only by Pearson) between kidney-cancer and normal groups.
Figure 4The scatterplot and probability density distribution of UPK3A gene expression (reported significant genes detected only by Spearman) between kidney-cancer and normal groups.
Figure 5The scatterplot and probability density distribution of SLC26A9 gene expression (reported significant genes detected only by Distance) between kidney-cancer and normal groups.