| Literature DB >> 25997848 |
Sarah A Gagliano1,2,3, Andrew D Paterson4,5,6,7,8, Michael E Weale9, Jo Knight10,11,12,13.
Abstract
BACKGROUND: In silico models have recently been created in order to predict which genetic variants are more likely to contribute to the risk of a complex trait given their functional characteristics. However, there has been no comprehensive review as to which type of predictive accuracy measures and data visualization techniques are most useful for assessing these models.Entities:
Mesh:
Year: 2015 PMID: 25997848 PMCID: PMC4440290 DOI: 10.1186/s12864-015-1616-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Predictive accuracy measures in the literature for models for prediction of variants associated with complex traits
| Predictive accuracy measures employed | ||||||||
|---|---|---|---|---|---|---|---|---|
| Algorithm | Classifier | Area under ROC curve | Positive predictive value | Box plot | Histo-Gram | Violin plot | Mann–Whitney U/Wilcoxon Rank Sum test | |
| Gagliano et al. 2014 | Modified Elastic net | GWAS hits vs. non-hits | x | x | x | |||
| Iversen et al. 2014 | Penalized logistic regression | GWAS hits vs. non-hits | x* | |||||
| Kircher et al. 2014 | Support Vector Machines | High-frequency human-derived alleles vs. simulated variants | x | x | x | |||
| Ritchie et al. 2014 | Modified Random Forest | HGMD hits vs. non-hits | x | x | x | |||
*reports “Concordance index”, which is equivalent to the area under the ROC curve
Predictive accuracy measures and the corresponding R package in which they can be computed
| Predictive accuracy measure | R package | Version |
|---|---|---|
| (1) The confusion matrix | ||
| Receiver Operating Characteristic Curve and area under the curve | prediction and performance in ROCR [ | 1.0-7 |
| Positive predictive value and negative predictive value | prediction and performance in ROCR performance (prediction.object, “ppv”) performance (prediction.object, “npv”) | 1.0-7 |
| (2) Visualization of the distribution of prediction values | ||
| Histograms of the prediction values separated by class | multhist in plotrix [ | 3.5-11 |
| Box plots | boxplot in graphics | Base package |
| Violin plots | vioplot in vioplot | |
| Quantile-quantile plots | qqplot in stats | Base package |
| (3) Statistical tests | ||
| Hypergeometric test | phyper in stats | Base package |
| Mann–Whitney | wilcox.test in stats | Base package |
| Asymptotic Generalized Cochran-Mantel-Haenszel Test | cmh_test in coin [ | 1.0-24 |
Descriptive statistics for the various genetic prediction models from Gagliano et al. (2014) to be used as examples here
| Phenotype-specific analyses | N | Minimum | 25 % percentile | Median | Mean | 75 % percentile | Maximum | Standard deviation | N outliers* | |
|---|---|---|---|---|---|---|---|---|---|---|
| Brain-related | Hits | 144 | 0.40 | 0.42 | 0.51 | 0.51 | 0.57 | 0.77 | 0.09 | 3 |
| Non-hits | 32723 | 0.40 | 0.40 | 0.46 | 0.48 | 0.53 | 0.79 | 0.07 | 61 | |
| Autoimmune | Hits | 234 | 0.29 | 0.45 | 0.55 | 0.55 | 0.66 | 0.86 | 0.14 | 0 |
| Non-hits | 33266 | 0.29 | 0.30 | 0.44 | 0.45 | 0.55 | 0.93 | 0.13 | 0 | |
| All phenotype analyses | ||||||||||
| p < 5E-8 | Hits | 1292 | 0.32 | 0.44 | 0.54 | 0.54 | 0.62 | 0.92 | 0.13 | 4 |
| Non-hits | 30135 | 0.32 | 0.35 | 0.44 | 0.46 | 0.55 | 0.91 | 0.12 | 7 | |
| all GWAS Catalogue | Hits | 3405 | 0.44 | 0.45 | 0.50 | 0.51 | 0.54 | 0.81 | 0.06 | 144 |
| Non-hits | 30039 | 0.44 | 0.44 | 0.48 | 0.49 | 0.52 | 0.80 | 0.05 | 336 | |
*Outliers are defined as data points outside 1.5x interquartile range (interquartile range = 75 % percentile - 25 % percentile)
Fig. 1A Confusion matrix and its relation to predictive accuracy terms. TPR = True Positive Rate, TNR = True Negative Rate, PPV = Positive Predictive Value, NPV = Negative Predictive Value
Fig. 2ROC curves for the four models
Positive predictive and negative predictive values at various prediction value cut-offs for the two all phenotype analyses
| Positive predictive values | Negative predictive values | |||
|---|---|---|---|---|
| Prediction value cut-off | p < 5E-08 hits | all GWAS hits in catalogue | p < 5E-08 hits | all GWAS hits in catalogue |
| 0.5 | 0.069 | 0.128 | 0.968 | 0.915 |
| 0.6 | 0.094 | 0.226 | 0.956 | 0.903 |
| 0.7 | 0.198 | 0.304 | 0.948 | 0.899 |
Fig. 3Histogram of predictive values for the all phenotype models with a bin size of 0.05. Compare to Fig. 4 with a bin size of 0.1. For the probability densities, the sum of the area under the black bars adds up to one. The same is true for the grey bars. The ideal plot would have two non-overlapping distributions with the distribution of the grey bars closest to 0 and the distribution of the black bars close to 1
Fig. 4Histogram of predictive values for the all phenotype models with a bin size of 0.1. Compare to Fig. 3 with a bin size of 0.05. For the probability densities, the sum of the area under the black bars adds up to one. The same is true for the grey bars. The ideal plot would have two non-overlapping distributions with the distribution of the grey bars closest to 0 and the distribution of the black bars close to 1. The bin size is 0.1
Fig. 5Box and whisker plots for the four models. The line in the box is the median, and the box outlines the 25 % and 75 % percentiles. Outliers are shown as individual data points if the value is 1.5 times the interquartile range (IQR). The lower and upper whiskers on the plot represent the 25 % percentile minus 1.5*IQR and the 75 % percentile plus 1.5*IQR, respectively. If the data does not extend as far as those calculated ranges, then the whisker is plotted at the value of the minimum or maximum data point
Fig. 6Violin plots of the four models
Fig. 7Quantile-quantile plots for the four models
Mann–Whitney U p-values for the four models
| Mann Whitney U p value | |||
|---|---|---|---|
| Unaltered | n(hits) = n(nonhits) | No outliers (1.5x outside 25 % or 75 % percentiles) | |
| Phenotype-specific analyses | |||
| Brain-related | 3.49E-06 | 0.007447 | 1.76E-05 |
| Autoimmune | 8.63E-28 | 5.26E-15 | 8.63E-28 |
| All phenotype analyses | |||
| p < 5E-8 | 2.08E-93 | 3.01E-52 | 3.53E-92 |
| All Catalogue | 7.17E-50 | 7.26E-27 | 1.37E-34 |
Fig. 8Ranked Mann–Whitney U p-values plotted separately for the hits and non-hits. The non-hits follow a uniform distribution, whereas the hits do not. The same pattern was observed for all four models