| Literature DB >> 29713383 |
Shefali S Verma1,2,3, Anastasia Lucas1,3, Xinyuan Zhang2,3, Yogasudha Veturi1,3, Scott Dudek1,3, Binglan Li2,3, Ruowang Li3, Ryan Urbanowicz3, Jason H Moore3, Dokyoon Kim1, Marylyn D Ritchie1,2,3.
Abstract
BACKGROUND: Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.Entities:
Keywords: Epistasis; Feature selection; Non-additive effects; Non-parametric methods; Obesity; Parametric methods
Year: 2018 PMID: 29713383 PMCID: PMC5907720 DOI: 10.1186/s13040-018-0168-6
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Parameters used for generating simulated experiment 1 data
| Type of Effect (100 and 500 SNPs) | Dataset name | Causal SNP Model (Penetrance: 0.1, 0.5 and 0.9) |
|---|---|---|
| Main Effect | 1SNP | G1 |
| 2SNP | G1, G2 | |
| 3SNP | G1, G2, G3 | |
| 4SNP | G1, G2, G3, G4 | |
| Interaction Effect | case1_control0 | G1 < −>G2 |
| case1_control1 | G1 < −>G2 | |
| case2_control0 | G1 < −>G2 | |
| case2_control2 | G1 < −>G2 |
All datasets consisted of 2000 cases and 2000 controls (4000 samples in total). ‘G’ here refers to the SNP ID prefix
Dummy example representing the simulation criteria for main effects in simulation experiment #1
| Case control status | SNP1 |
|---|---|
| 0 | 1 |
| 0 | 2 |
| 0 | 2 |
| 0 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
Here 0 in column1 refers to controls and 1 refers to cases. In column 2, 0,1 and 2 refers to the genotypes
Dummy example representing the simulation criteria for interacting effects in simulation experiment #1
| Case control status | SNP1 | SNP2 |
|---|---|---|
| 0 | 2 | 2 |
| 0 | 2 | 2 |
| 0 | 2 | 2 |
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
Here 0 in column1 refers to controls and 1 refers to cases. In column 2 and 3, 0,1 and 2 refers to the genotypes
Fig. 1Methods explored for feature selection and selection of top user defined percentage of features for comparison
Fig. 2Outer circle represent a collective feature selection approach
Fig. 3Pipeline of feature selection procedure and downstream analysis in both simulated and natural biological data
Fig. 4Comparison of results for TuRF parameters when using it with MultiSURF* as well as MultiSURF* without TuRF implementation. The plot on left is for 100 variables and the right plot is for 500 variables. The x-axis lists main effect and interaction datasets while the y-axis lists all methods tested. These plots are faceted by percentage of top variables selected and the strength of signal. The color gradient refers to the sensitivity (percentage of true positives), ranging blue to orange, or 0 to1
Fig. 5Distribution of median training accuracy from pMDR analyses. X-axis is median training accuracy values of the model, Y-axis lists all main effect and interaction simulated datasets. These plots are faceted by effect size and top percentage of models selected. 100 SNP data is shown in circles and 500 SNP in triangles. The two colors represent actual true and false positives in results
Fig. 6Comparison of results from all methods tested on simulated dataset 1. These heat maps show the sensitivity of results for all methods (on y-axis) and all simulated models (on x-axis) for both 100 SNPs and 500 SNPs datasets in combination with different effect sizes and selection percentage of top features
Fig. 7Comparison of results from all methods tested on simulated dataset 2. These heat maps show the sensitivity of results for all methods (on y-axis) and both simulated models (on x-axis) in combination with different effect sizes (heritability values of 0.1, 0.2 and 0.4) and selection percentage of top features
Fig. 8Plot showing time in seconds (on y-axis) taken for running all feature selection methods. All simulated models are presented in x-axis. Color represents each method. Circles are for 100 SNPs datasets and triangles for 500 SNPs datasets
Computational time and memory requirements for all feature selection methods, compared in terms of number of SNPs
| Method | Computational time based on number of SNPs (in seconds) | Memory requirements based on number of SNPs | ||||||
|---|---|---|---|---|---|---|---|---|
| 100 | 500 | 50,000 | 100,000 | 100 | 500 | 50,000 | 100,000 | |
| LASSO | 4.65 | 58.49 | NA | NA | 10gb | 10gb | NA | NA |
| LASSO with interactions | 186.9 | 1800 | NA | NA | 20gb | 20gb | NA | NA |
| Elastic Net | 31.44 | 401.11 | NA | NA | 10gb | 10gb | NA | NA |
| Ranger | 151.31 | 681.83 | NA | NA | 8gb | 8gb | NA | NA |
| Gradient Boosting | 103.22 | 466.95 | NA | NA | 8gb | 8gb | NA | NA |
| MDR | 0.25 | 15.03 | 6102 | 89,777 | 1gb | 1gb | 10gb | 30gb |
| MultiSURF | 2.48 | 5.13 | NA | NA | 18gb | 39gb | NA | NA |
| MultiSURF + TuRF 0.05 | 36.72 | 65.72 | 4420 | 8321 | 18gb | 39gb | 28gb | 28gb |
Note that “NA” here stands for where the model could not be tested due to computational infeasibility while keeping all parameters for simulated datasets same
Fig. 9Venn Diagrams shown here represent the overlap among the top features selected by all methods while bar charts below each Venn diagram show the number of true positives and false positives selected by each method. Plot (a) illustrates results for EDM-1 datasets and Plot (b) contains results for EDM-2 datasets
Fig. 10Number of features selected by collective approach
Fig. 11Collective feature selection to select 1758 variables with potential epistatic effect from MyCode data
Training and testing AUC for models selected by ATHENA
| Cross Validation | Training AUC | Testing AUC |
|---|---|---|
| CV1 | 0.552115 | 0.537071 |
| CV2 | 0.546601 | 0.540414 |
| CV3 | 0.543943 | 0.547598 |
| CV4 | 0.549398 | 0.538175 |
| CV5 | 0.555795 | 0.541373 |
Fig. 12Best GENN model selected from ATHENA. The SNPs are annotated to gene names. On the bottom right is the list of variants and genes in the model and which feature selection method selected the variant are colored in the table to represent the presence (in orange) and absence (in white) of variant in each feature selection method
P-values and betas from regression analyses on 5 SNPs in the network selected by ATHENA
| SNP | Gene name | No covariates | With covariates | ||
|---|---|---|---|---|---|
| p-val | beta | p-val | beta | ||
| exm-rs11075987 |
| 6.76E-11 | 0.173939 | 8.02E-09 | 0.1690 |
| rs7232886 |
| 9.36E-09 | −0.15448 | 6.24E-06 | −0.1336 |
| rs2756184 |
| 0.014104 | −0.06635 | 0.018812 | −0.0699 |
| rs9520911 |
| 0.045223 | 0.053358 | 0.051185 | 0.0573 |
| rs7043308 |
| 0.768626 | − 0.00958 | 0.745697 | − 0.0116 |