| Literature DB >> 26072497 |
Damian Roqueiro1, Menno J Witteveen1, Verneri Anttila2, Gisela M Terwindt1, Arn M J M van den Maagdenberg3, Karsten Borgwardt1.
Abstract
MOTIVATION: Predicting disease phenotypes from genotypes is a key challenge in medical applications in the postgenomic era. Large training datasets of patients that have been both genotyped and phenotyped are the key requisite when aiming for high prediction accuracy. With current genotyping projects producing genetic data for hundreds of thousands of patients, large-scale phenotyping has become the bottleneck in disease phenotype prediction.Entities:
Mesh:
Year: 2015 PMID: 26072497 PMCID: PMC4765855 DOI: 10.1093/bioinformatics/btv254
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Partitioning of the data and the proposed two-stage approach to co-training. Information about the disease phenotype (diagnosis) indicated as “pheno” and colored in gold (true labels) and silver (imputed labels). Clinical covariates shown as a generic real value (9.9). Genotype information coded as a model of additive effects {0, 1, 2}
Bounds and prediction performance of in silico phenotyping. Partition of the data into: set I = 10%, set II = 70% and set III = 20%; 100 random folds
Fig. 2.ROC curves of bounds and prediction performance of in silico phenotyping. Partition of data into sets: I = 10%, II = 70% and III = 20%; 100 random folds
Varying sizes of I and its AUC scores when II is fixed (size of II = 1356 samples, 70% of the data; 100 random folds)
| AUC scores | |||
|---|---|---|---|
| Number of samples in I | |||
| 193 | 10% of the data | 0.646 | 0.029 |
| 96 | 5% | 0.619 | 0.034 |
| 19 | 1% | 0.605 | 0.035 |
Varying sizes of II and its AUC scores when I is fixed (size of I = 193 samples, 10% of the data; 100 random folds)
| AUC scores | |||
|---|---|---|---|
| Number of samples in II | |||
| 774 | 40% of the data | 0.597 | 0.038 |
| 969 | 50% | 0.604 | 0.035 |
| 1162 | 60% | 0.611 | 0.035 |
| 1356 | 70% | 0.646 | 0.029 |
Fig. 3.Delta of mean AUC (in silico phenotyping vs. lower bound) for varying sizes of I and II
Varying the number of selected features for the genotype classifier h (mean AUC μ and standard deviation σ). Row marked with asterisk (*) indicates the optimal value of k reported by the internal cross-validation
| AUC scores | ||
|---|---|---|
| Number of top | ||
| 200 | 0.624 | 0.031 |
| 400 | 0.631 | 0.032 |
| 800 | 0.638 | 0.031 |
| 1600 | 0.644 | 0.028 |
| 2000 | 0.646 | 0.029 |
| 3200 | 0.648 | 0.028 |
| 6400 | 0.651 | 0.030 |
| 12 800 | 0.650 | 0.027 |
| *25 600 | 0.648 | 0.029 |
| 51 200 | 0.643 | 0.026 |