| Literature DB >> 25573332 |
Fabrízzio Condé de Oliveira, Carlos Cristiano Hasenclever Borges, Fernanda Nascimento Almeida, Fabyano Fonseca e Silva, Rui da Silva Verneque, Marcos Vinicius G B da Silva, Wagner Arbex.
Abstract
INTRODUCTION: This paper proposes a new methodology to simultaneously select the most relevant SNPs markers for the characterization of any measurable phenotype described by a continuous variable using Support Vector Regression with Pearson Universal kernel as fitness function of a binary genetic algorithm. The proposed methodology is multi-attribute towards considering several markers simultaneously to explain the phenotype and is based jointly on statistical tools, machine learning and computational intelligence.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25573332 PMCID: PMC4243330 DOI: 10.1186/1471-2164-15-S7-S4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Flowchart description of the proposed method.
Statistics of the PTA for milk and of the simulations 1 and 2.
| Database | Minimum | 1º Quartile | Median | Mean | 3º Quartile | Maximum |
|---|---|---|---|---|---|---|
| Real | -479.5 | 328.0 | 583.2 | 641.3 | 908.3 | 1,978.0 |
| Simulation 1 | -12.6 | 200.9 | 396.8 | 378.0 | 594.8 | 1,296.0 |
| Simulation 2 | -3.163 | -0.2049 | 1.4320 | 57.66 | 149.4 | 301.3 |
Figure 2Histogram and boxplot of the real and simulated phenotypes. Histogram of the real phenotype (2a), simulated phenotype 1 (2b), simulated phenotype 2 (2c) and boxplots of the real phenotype (2d), simulated phenotype 1 (2e), (f) simulated phenotype 2 (2f).
Number of SNPs selected from the p-value of Spearman correlation coefficient in the real database
| Groups | SNP selection* | # SNPs without QC | # SNPs with QC |
|---|---|---|---|
| 1 | < 10-9 | 68 | 12 |
| 2 | < 10-8 | 226 | 17 |
| 3 | < 10-7 | 431 | 43 |
| 4 | < 10-6 | 712 | 105 |
| 5 | < 10-5 | 1,181 | 242 |
| 6 | < 10-4 | 1,996 | 595 |
| 7 | < 10-3 | 3,440 | 1,397 |
| 8 | < 10-2 | 6,512 | 3,356 |
*SNP markers selected with a p-value less than the stipulated threshold.
Raw and adjusted p-values of the Spearman correlation of the SNPs main in the simulated database 1.
| SNP | Raw p-value | Adjusted p-value* |
|---|---|---|
| 1 | 1.658460e-13 | 1.658460e-10 |
| 10 | 5.737988e-10 | 5.737988e-07 |
| 20 | 1.898224e-03 | 1.000000e+00 |
| 30 | 1.073910e-02 | 1.000000e+00 |
| 40 | 5.366227e-01 | 1.000000e+00 |
| 50 | 3.520811e-04 | 3.520811e-01 |
| 60 | 7.062554e-06 | 7.062554e-03 |
*Adjusted p-values are calculated multiplying raw p-values by total number SNPs which in this case are 1,000 markers.
Mean and standard deviation of the Pearson correlation for the SVR models constructed from subsets of SNPs selected by raw p-values of the simulated database 1.
| Group | SNP selection with raw p-values | # SNPs | Linear** | RBF** | PUK** |
|---|---|---|---|---|---|
| 1 | < 10-9 | 2a | 0.49(0.14) | 0.48(0.14) | 0.54(0.14) |
| 2 | < 10-8 | 2a | 0.49(0.14) | 0.48(0.14) | 0.54(0.14) |
| 3 | < 10-7 | 2a | 0.49(0.14) | 0.48(0.14) | 0.54(0.14) |
| 4 | < 10-6 | 2a | 0.49(0.14) | 0.48(0.14) | 0.54(0.14) |
| 5 | < 10-5 | 3b | 0.49(0.14) | 0.55(0.14) | 0.56(0.13) |
| 6 | < 10-4 | 3b | 0.49(0.14) | 0.55(0.14) | 0.56(0.13) |
| 7 | < 10-3 | 6c | 0.55(0.14) | 0.58(0.14) | 0.71(0.16) |
| 8 | < 10-2 | 14 | 0.67(0.13) | 0.67(0.13) | 0.73(0.12) |
| 9 | < 10-1 | 119 | 0.49(0.14) | 0.55(0.14) | 0.56(0.13) |
| 11 | < 0.30 | 307 | 0.50(0.09) | 0.65(0.07) | 0.68(0.06) |
| 12 | < 0.40 | 401 | 0.60(0.11) | 0.69(0.09) | 0.72(0.09) |
| 13 | < 0.50 | 511 | 0.63(0.10) | 0.67(0.09) | 0.67(0.10) |
| 14 | < 0.60 | 598 | 0.58(0.10) | 0.62(0.10) | 0.63(0.11) |
| 15 | < 0.70 | 702 | 0.54(0.12) | 0.58(0.12) | 0.57(0.12) |
| 16 | < 0.80 | 803 | 0.49(0.14) | 0.50(0.14) | 0.49(0.14) |
| 17 | < 0.90 | 914 | 0.39(0.16) | 0.40(0.17) | 0.39(0.17) |
| 18 | - | 7d | 0.64(0.13) | 0.97(0.03) | 0.96(0.04) |
*Raw p-values of the Spearman correlation (without Bonferroni correction).
** Mean and standard deviation of the Pearson correlation in 10-fold cross validation.
(a) Only SNPs 1 and 10. (b) Only SNPs 1, 10 and 60. (c) Only SNPs 1, 10, 50, 60, 121 and 874. (d) Correct model with 7 relevant SNPs and without redundant SNPs (only SNPs 1, 10, 20, 30, 40, 50 and 60).
For each group defined after the first selection, a model SVR was constructed, evaluated for each kernel, through the Pearson correlation coefficient in 10-fold cross validation. The bold line indicates the best model from the first selection for PUK.
Mean and standard deviation of the Pearson correlation for the SVR models constructed from subsets of SNPs selected by adjusted p-values of the simulated database 1.
| Group | SNP selection with adjusted p-values* | # SNPs | Linear** | RBF** | PUK** |
|---|---|---|---|---|---|
| 1 | < 10-9 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 2 | < 10-8 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 3 | < 10-7 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 4 | < 10-6 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 5 | < 10-5 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 6 | < 10-4 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 7 | < 10-3 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 8 | < 10-2 | 1a | 0.36(0.17) | 0.37(0.17) | 0.45(0.17) |
| 9 | < 10-1 | 3b | 0.49(0.14) | 0.55(0.14) | 0.56(0.13) |
| 10 | < 0.20 | 3b | 0.49(0.14) | 0.55(0.14) | 0.56(0.13) |
| 11 | < 0.30 | 3b | 0.49(0.14) | 0.55(0.14) | 0.56(0.13) |
| 13 | < 0.50 | 4c | 0.53(0.15) | 0.56(0.14) | 0.78(0.16) |
| 14 | < 0.60 | 4c | 0.53(0.15) | 0.56(0.14) | 0.78(0.16) |
| 15 | < 0.70 | 5d | 0.53(0.15) | 0.56(0.14) | 0.75(0.16) |
| 16 | < 0.80 | 6e | 0.55(0.14) | 0.58(0.14) | 0.71(0.16) |
| 17 | < 0.90 | 6e | 0.55(0.14) | 0.58(0.14) | 0.71(0.16) |
| 18 | - | 7f | 0.65(0.09) | 0.65(0.09) | 0.93(0.02) |
*Adjusted p-values of the Spearman correlation with Bonferroni correction.
** Mean and standard deviation of the Pearson correlation in 10-fold cross validation.
(a) SNP1. (b) SNP1, 10 and 60. (c) SNP1, 10, 50 and 60. (d) SNP1, 10, 50, 60 and 874. (e) SNP1, 10, 50, 60, 121 and 874. (f) Correct model with 7 relevant SNPs and without redundant SNPs.
For each group defined after the first selection, a model SVR was constructed, evaluated for each kernel, through the Pearson correlation coefficient in 10-fold cross validation. The bold line indicates the best model from the first selection for PUK.
Number of SNPs and model performance referring to the group 10 of the Table 5 before and after the application of the GA.
| Filter | # SNPs | Linear** | RBF** | PUK** |
|---|---|---|---|---|
| Before GA | 220 | 0.36(0.18) | 0.69(0.09) | 0.73(0.09) |
*The selected markers are SNPs 1, 10, 15, 20, 30, 60, 158, 177, 269, 274, 391, 446, 516, 673, 686, 693, 717, 725, 739, 825 and 930.
** Mean and standard deviation of the Pearson correlation in 10-fold cross validation.
The bold line indicates the best model from the second selection for PUK.
The parameters used by the GA were: population size = 100, number of generations = 500, crossover probability = 0.60, mutation probability = 0.033, seed =1.
Figure 3Scatter plot between the SNP3 and simulated phenotype 2 (with epistasis).
Mean and standard deviation of the Pearson correlation for the SVR models constructed from subsets of SNPs selected by adjusted p-values of the simulated database 2.
| Group | SNP selection with adjusted p-values | # SNPs | Linear** | RBF** | PUK** |
|---|---|---|---|---|---|
| 2 | < 10-8 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 3 | < 10-7 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 4 | < 10-6 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 5 | < 10-5 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 6 | < 10-4 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 7 | < 10-3 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 8 | < 10-2 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 9 | < 10-1 | 3b | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 10 | < 0.20 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 11 | < 0.30 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 12 | < 0.40 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 13 | < 0.50 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 14 | < 0.60 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 15 | < 0.70 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 16 | < 0.80 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 17 | < 0.90 | 4c | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
*Adjusted p-values of the Spearman correlation with Bonferroni correction.
** Mean and standard deviation of the Pearson correlation in 10-fold cross validation.
(a) Only the SNP3. (b) Only SNPs 3, 4 and 5. (c) Only SNPs 3, 4, 5 and 8964.
For each group defined after the first selection, a model SVR was constructed, evaluated for each kernel, through the Pearson correlation coefficient in 10-fold cross validation. The bold line indicates the best model from the first selection for PUK.
Raw and adjusted p-values of the Spearman correlation of the SNPs main in the simulated database 2.
| SNP | Raw p-value | Adjusted p-value* |
|---|---|---|
| 3 | 5.824261e-41 | 5.824261e-37 |
| 4 | 6.087710e-06 | 6.087710e-02 |
| 5 | 3.677127e-06 | 3.677127e-02 |
| 9 | 7.890460e-01 | 1.000000e+00 |
| 12 | 2.088276e-01 | 1.000000e+00 |
*Adjusted p-values are calculated multiplying raw p-values by total number SNPs which in this case are 10,000 markers.
Mean and standard deviation of the Pearson correlation coefficient for the simulated database 2 groups with raw p-values.
| Group | SNP selection with raw p-values* | # SNPs | Linear** | RBF** | PUK** |
|---|---|---|---|---|---|
| 2 | < 10-8 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 3 | < 10-7 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 4 | < 10-6 | 1a | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 5 | < 10-5 | 3b | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 6 | < 10-4 | 5 | 0.53(0.07) | 0.94(0.04) | 0.94(0.04) |
| 7 | < 10-3 | 15 | 0.53(0.07) | 0.94(0.04) | 0.95(0.04) |
| 8 | < 10-2 | 99 | 0.64(0.07) | 0.68(0.06) | 0.69(0.06) |
| 9 | < 10-1 | 995 | 0.66(0.06) | 0.79(0.04) | 0.80(0.04) |
| 10 | < 0.20 | 2,053 | 0.73(0.05) | 0.80(0.03) | 0.80(0.03) |
| 11 | < 0.30 | 3,079 | 0.74(0.04) | 0.79(0.03) | 0.79(0.03) |
| 12 | < 0.40 | 4,066 | 0.75(0.04) | 0.78(0.04) | 0.78(0.04) |
| 13 | < 0.50 | 5,033 | 0.74(0.04) | 0.75(0.04) | 0.76(0.04) |
| 14 | < 0.60 | 5,996 | 0.69(0.05) | 0.71(0.05) | 0.71(0.04) |
| 15 | < 0.70 | 6,950 | 0.63(0.05) | 0.64(0.05) | 0.65(0.05) |
| 16 | < 0.80 | 7,934c | 0.51(0.07) | 0.51(0.07) | 0.53(0.07) |
| 17 | < 0.90 | 8,893c | 0.33(0.10) | 0.32(0.10) | 0.33(0.10) |
| 18 | - | 5d | 0.53(0.07) | 0.97(0.02) | 0.99(0.01) |
*Raw p-values of the Spearman correlation (without Bonferroni correction).
** Mean and standard deviation of the Pearson correlation in 10-fold cross validation.
(a) Only SNP3. (b) Contains SNPs 3, 4 and 5. (c) Contains SNPs 3, 4, 5, 9 and 12. (d) Only SNPs 3, 4, 5, 9 and 12.
Mean and standard deviation of the Pearson correlation for the SVR models constructed from subsets of SNPs selected by raw p-values of the real database without QC.
| Group | SNP selection with raw p-values* | # SNPs | Linear** | RBF** | PUK** |
|---|---|---|---|---|---|
| 1 | < 10-9 | 68 | 0.60(0.14) | 0.68(0.11) | 0.68(0.11) |
| 2 | < 10-8 | 226 | 0.48(0.17) | 0.72(0.09) | 0.72(0.09) |
| 3 | < 10-7 | 431 | 0.44(0.16) | 0.74(0.09) | 0.75(0.08) |
| 4 | < 10-6 | 712 | 0.71(0.09) | 0.77(0.08) | 0.74(0.09) |
| 5 | < 10-5 | 1,181 | 0.76(0.09) | 0.76(0.08) | 0.78(0.08) |
| 6 | < 10-4 | 1,996 | 0.78(0.08) | 0.74(0.08) | 0.78(0.08) |
| 7 | < 10-3 | 3,440 | 0.80(0.08) | 0.67(0.13) | 0.80(0.08) |
*Raw p-values of the Spearman correlation (without Bonferroni correction).
** Standard deviation of the estimates of the 10-fold cross validation.
Mean and standard deviation of the Pearson correlation for the SVR models constructed from subsets of SNPs selected by raw p-values of the real database with QC.
| Group | SNP selection with raw p-values* | # SNPs | Linear** | RBF** | PUK** |
|---|---|---|---|---|---|
| 1 | < 10-9 | 12 | 0.67(0.10) | 0.67(0.10) | 0.67(0.10) |
| 2 | < 10-8 | 17 | 0.64(0.10) | 0.67(0.10) | 0.67(0.10) |
| 3 | < 10-7 | 43 | 0.59(0.12) | 0.68(0.09) | 0.70(0.08) |
| 4 | < 10-6 | 105 | 0.31(0.18) | 0.72(0.07) | 0.71(0.07) |
| 5 | < 10-5 | 242 | 0.67(0.09) | 0.77(0.08) | 0.78(0.07) |
| 6 | < 10-4 | 595 | 0.77(0.08) | 0.69(0.09) | 0.79(0.07) |
| 7 | < 10-3 | 1,397 | 0.78(0.08) | 0.79(0.09) | 0.79(0.08) |
*Raw p-values of the Spearman correlation (without Bonferroni correction).
** Mean and standard deviation of the Pearson correlation in 10-fold cross validation.
The bold line indicates the best model from the first selection for PUK.
Mean and standard deviation of the correlation coefficient of Pearson in 10-folds with 10 repetitions in the best subset found by GA with the same parameters used for the group 8 of the Table 11
| Real Database | # SNPs | Kernel | ||
|---|---|---|---|---|
| Linear | RBF | PUK | ||
| Database without QC before GA | 6,512 | 0.81 (0.08) | 0.81(0.08) | 0.81 (0.08) |
| Database with QC before GA | 3,357 | 0.80(0.08) | 0.75(0.08) | 0.80(0.08) |
** Mean and standard deviation of the Pearson correlation in 10-fold cross validation.
The bold lines indicate the best model from the second selection for PUK.
The parameters used by the GA were: population size = 20, number of generations = 20, crossover probability = 0.60, mutation probability = 0.033, seed = 1.
Figure 4LD matrix between markers selected by the proposed method. LD matrix computed by r2 between markers of group 8 without QC (4a) and with QC (4c), and the subsets extracted of the group 8 by GA without QC (4b) and with QC (4d). The color scale is interpreted as follows: high LD is white and low LD is green.