| Literature DB >> 27980627 |
Emily R Holzinger1, Silke Szymczak2, James Malley3, Elizabeth W Pugh4, Hua Ling4, Sean Griffith4, Peng Zhang4, Qing Li1, Cheryl D Cropp1, Joan E Bailey-Wilson1.
Abstract
Current findings from genetic studies of complex human traits often do not explain a large proportion of the estimated variation of these traits due to genetic factors. This could be, in part, due to overly stringent significance thresholds in traditional statistical methods, such as linear and logistic regression. Machine learning methods, such as Random Forests (RF), are an alternative approach to identify potentially interesting variants. One major issue with these methods is that there is no clear way to distinguish between probable true hits and noise variables based on the importance metric calculated. To this end, we are developing a method called the Relative Recurrency Variable Importance Metric (r2VIM), a RF-based variable selection method. Here, we apply r2VIM to the unrelated Genetic Analysis Workshop 19 data with simulated systolic blood pressure as the phenotype. We compare the number of "true" functional variants identified by r2VIM with those identified by linear regression analyses that use a Bonferroni correction to calculate a significance threshold. Our results show that r2VIM performed comparably to linear regression. Our findings are proof-of-concept for r2VIM, as it identifies a similar number of functional and nonfunctional variants as a more commonly used technique when the optimal importance score threshold is used.Entities:
Year: 2016 PMID: 27980627 PMCID: PMC5133476 DOI: 10.1186/s12919-016-0021-1
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Fig. 1Results for the linear regression analysis of the simulated SBP phenotype for 2 p value thresholds (p <0.05 and p <5 × 10−7). The x-axis represents the variant index, which is in order of genome location. The y-axis shows the − log10(p value). The variants in red indicate functional variant (left) and variants in functional genes (right)
Counts for the number of variants selected at different thresholds for linear regression (p value) and r2VIM (min.RIS)
| Method | Threshold | All (~350 k) | Func. vars. (1047) | Func. genes (4328) |
|---|---|---|---|---|
| Linear Regression |
| 7739 | 35 | 136 |
|
| 59 | 6 | 9 | |
| r2VIM | min.RIS >0 | 340 | 6 | 9 |
| min.RIS >0.5 | 37 | 5 | 8 | |
| min.RIS >1 | 25 | 5 | 8 |
The total number selected, the number of variants simulated directly to be functional, and the number of variants in the functional genes are shown
Fig. 2Results for the r2VIM analysis of the simulated SBP phenotype for 3 min.RIS thresholds (RIS <0, RIS <0.5, and RIS <1). The x-axis represents the variant index, which is in order of genome location. The y-axis shows the min.RIS. The variants in red indicate functional variant (left) and variants in functional genes (right)
Fig. 3Comparison of the min.RIS score from the r2VIM analysis (y-axis) and the − log10(p value) for the linear regression analysis of the simulated SBP phenotype. The variants in red indicate functional variant (top) and variants in functional genes (bottom)
Fig. 4Results for the r2VIM analysis of the simulated Q1 phenotype. The x-axis represents the variant index, which is in order of genome location. The y-axis shows the min.RIS. The variants in red indicate functional variant (left) and variants in functional genes (right) from the SBP phenotype simulated model