| Literature DB >> 23281841 |
Dankyu Yoon1, Young Jin Kim, Taesung Park.
Abstract
BACKGROUND: A great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction. However, previous prediction studies using known disease associated loci have not been successful (Area Under Curve 0.55 ~ 0.68 for type 2 diabetes and coronary heart disease). There are several reasons for poor predictability such as small number of known disease-associated loci, simple analysis not considering complexity in phenotype, and a limited number of features used for prediction.Entities:
Mesh:
Year: 2012 PMID: 23281841 PMCID: PMC3521177 DOI: 10.1186/1752-0509-6-S2-S11
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Smoking behaviours phenotypes
| Phenotype | # of cases | # of controls |
|---|---|---|
| CPD10* | 6,02 | 752 |
| Smoking Initiation (SI) | 3357 | 807 |
| Smoking Cessation (SC) | 2064 | 1293 |
CPD10: binary phenotype of nicotine dependence (ND) defined as <10 cigarettes/day and >21 cigarettes/day
Figure 1Prediction performance according to feature selection method (a) logistic regression (LOG), (b) linear discriminant analysis (LDA), (c) Elastic Net (EN), (d) support vector machine (SVM), and (e) random forest (RF) varying # of SNPs used in prediction for CPD10. X-axis represents the # of SNPs, Y-axis stands for the AUC score.
Performance results for CPD10
| CPD10 | Prediction method | |||||
|---|---|---|---|---|---|---|
| LR | 100 | 0.7973 | 0.8128 | 0.7715 | 0.8078 | |
| 400 | 0.8017 | 0.8606 | 0.9137 | 0.8966 | ||
| SVM | 100 | |||||
| 400 | 0.8474 | |||||
| RF | 100 | 0.8143 | 0.7999 | 0.821 | 0.8206 | |
| 400 | 0.7752 | 0.8709 | 0.8813 | 0.8669 | ||
| EN | 100a | 0.8547 | 0.8273 | 0.8567 | 0.8585 | |
| 250a | 0.8731 | 0.9046 | 0.9022 | |||
| LDA | 100 | 0.7758 | 0.7801 | 0.7205 | 0.7814 | |
| 400 | 0.7948 | 0.8411 | 0.911 | 0.8939 | ||
In each column, the best results are shown as underlined. In each row, the best results are boldfaced.
Performance results for SI
| SI | Prediction method | |||||
|---|---|---|---|---|---|---|
| LR | 100 | 0.7597 | 0.7171 | 0.7067 | 0.7605 | |
| 500 | ||||||
| SVM | 100 | 0.6421 | 0.6204 | 0.6794 | 0.6813 | |
| 500 | 0.7953 | 0.6930 | 0.7943 | 0.8075 | ||
| RF | 100 | 0.5961 | 0.5980 | 0.5848 | 0.5957 | |
| 500 | 0.6185 | 0.6138 | 0.6010 | 0.6210 | ||
| EN | 100a | |||||
| 163a | 0.8157 | 0.8084 | 0.7454 | 0.8180 | ||
| LDA | 100 | 0.6338 | 0.5925 | 0.5807 | 0.6273 | |
| 500 | 0.7387 | 0.6212 | 0.7176 | 0.7464 | ||
In each column, the best results are shown as underlined. In each row, the best results are boldfaced.
Performance results for SC
| SC | Prediction method | |||||
|---|---|---|---|---|---|---|
| LR | 100 | 0.7290 | 0.7187 | 0.6919 | 0.7308 | |
| 500 | 0.8245 | 0.7856 | 0.8587 | 0.8591 | ||
| SVM | 100 | 0.7299 | 0.6975 | 0.7413 | 0.7414 | |
| 500 | ||||||
| RF | 100 | 0.7007 | 0.7053 | 0.7013 | 0.6992 | |
| 500 | 0.7728 | 0.7598 | 0.7676 | 0.7714 | ||
| EN | 100a | |||||
| 176a | 0.7935 | 0.7964 | 0.7485 | 0.7967 | 0.7955 | |
| LDA | 100 | 0.7015 | 0.6946 | 0.6585 | 0.7004 | |
| 500 | 0.8205 | 0.7426 | 0.8275 | 0.8291 | ||
In each column, the best results are shown as underlined. In each row, the best results are boldfaced. a. # of SNPs with non-zero coefficient.
Figure 2Performance comparison with the same feature selection and prediction method (a) CPD10, (b) SI, and (c) SC. X-axis represents the # of SNPs, Y-axis stands for the AUC score.