| Literature DB >> 27980626 |
Elizabeth Held1, Joshua Cape2, Nathan Tintle3.
Abstract
Machine learning methods continue to show promise in the analysis of data from genetic association studies because of the high number of variables relative to the number of observations. However, few best practices exist for the application of these methods. We extend a recently proposed supervised machine learning approach for predicting disease risk by genotypes to be able to incorporate gene expression data and rare variants. We then apply 2 different versions of the approach (radial and linear support vector machines) to simulated data from Genetic Analysis Workshop 19 and compare performance to logistic regression. Method performance was not radically different across the 3 methods, although the linear support vector machine tended to show small gains in predictive ability relative to a radial support vector machine and logistic regression. Importantly, as the number of genes in the models was increased, even when those genes contained causal rare variants, model predictive ability showed a statistically significant decrease in performance for both the radial support vector machine and logistic regression. The linear support vector machine showed more robust performance to the inclusion of additional genes. Further work is needed to evaluate machine learning approaches on larger samples and to evaluate the relative improvement in model prediction from the incorporation of gene expression data.Entities:
Year: 2016 PMID: 27980626 PMCID: PMC5133520 DOI: 10.1186/s12919-016-0020-2
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Fig. 1Overall performance (AUC) of the 3 classification approaches across 315 different situations. All 3 methods performed fairly similarly on AUC overall. Linear SVM tended to slightly outperform both other methods across the 315 different settings investigated
Regression analysis summarizing association of model parameters and model performance across 315 different situations
| Method | |||
|---|---|---|---|
| Model parameters | LR | Radial SVM | Linear SVM |
| Gene expression noise ( | −3.5 × 10−2 (1.2 × 10−2)** | −6.4 × 10−2 (2.0 × 10−2)** | −3.3 × 10−2 (1.7 × 10−2) |
| Number of collapsed phenotypes ( | −4 × 10−5 (1.1 × 10−5)*** | −5.49 × 10−6 (1.7 × 10−5) | −4.4 × 10−5 (1.5 × 10−5)** |
| Number of causal genes | −7.5 × 10−4 (1.7 × 10−4)*** | −8.2 × 10−4 (2.7 × 10−4)** | −1.6 × 10−4 (2.3 × 10−4) |
| Number of random genes | −1.7 × 10−3 (3.6 × 10−5)*** | −9.8 × 10−4 (5.6 × 10−5)*** | −1.0 × 10−3 (4.8 × 10−5)*** |
| Model r2 | 88.4 % | 50.7 % | 62.5 % |
, the estimated coefficient in the regression model; SE, the estimated standard error of the coefficient
Regression models predicted AUC by 4 different model parameters for each of the 3 methods separately
Statistical significance of the estimated regression coefficients is indicated by asterisks (***p <0.001, **p <0.01)
Comparing model AUC with and without gene expression data
| Number of causal genes | LR | Radial SVM | Linear SVM | |||
|---|---|---|---|---|---|---|
| Without gene exp. | With gene exp. | Without gene exp. | With gene exp. | Without gene exp. | With gene exp. | |
| 1 | 0.775 | 0.772 | 0.764 | 0.764 | 0.767 | 0.772 |
| 5 | 0.772 | 0.773 | 0.755 | 0.760 | 0.759 | 0.770 |
| 10 | 0.785 | 0.778 | 0.739 | 0.755 | 0.776 | 0.770 |
Model AUC is reported in the table for SIMPHEN.197, k = 0 (when expression data was included) and m = 5. The table shows that the inclusion of gene expression data had little-to-no impact on the AUC in this data set