| Literature DB >> 26108231 |
Damian Gola, Jestinah M Mahachie John, Kristel van Steen, Inke R König.
Abstract
Complex diseases are defined to be determined by multiple genetic and environmental factors alone as well as in interactions. To analyze interactions in genetic data, many statistical methods have been suggested, with most of them relying on statistical regression models. Given the known limitations of classical methods, approaches from the machine-learning community have also become attractive. From this latter family, a fast-growing collection of methods emerged that are based on the Multifactor Dimensionality Reduction (MDR) approach. Since its first introduction, MDR has enjoyed great popularity in applications and has been extended and modified multiple times. Based on a literature search, we here provide a systematic and comprehensive overview of these suggested methods. The methods are described in detail, and the availability of implementations is listed. Most recent approaches offer to deal with large-scale data sets and rare variants, which is why we expect these methods to even gain in popularity.Entities:
Keywords: data mining; epistasis; interaction; machine learning; multifactor dimensionality reduction
Mesh:
Year: 2015 PMID: 26108231 PMCID: PMC4793893 DOI: 10.1093/bib/bbv038
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1.Roadmap of Multifactor Dimensionality Reduction (MDR) showing the temporal development of MDR and MDR-based approaches. Abbreviations and further explanations are provided in the text and tables.
Figure 2.Flow diagram depicting details of the literature search. Database search 1: 6 February 2014 in PubMed (www.ncbi.nlm.nih.gov/pubmed) for [(‘multifactor dimensionality reduction’ OR ‘MDR’) AND genetic AND interaction], limited to Humans; Database search 2: 7 February 2014 in PubMed (www.ncbi.nlm.nih.gov/pubmed) for [‘multifactor dimensionality reduction’ genetic], limited to Humans; Database search 3: 24 February 2014 in Google scholar (scholar.google.de/) for [‘multifactor dimensionality reduction’ genetic].
Overview of named MDR-based methods
| Name | Description | Data structure | Cov | Pheno | Small sample sizes | Applications |
|---|---|---|---|---|---|---|
| Multifactor Dimensionality Reduction (MDR) [ | Reduce dimensionality of multi-locus information by pooling multi-locus genotypes into high-risk and low-risk groups | U | No/yes, depends on implementation (see | D | No | Numerous phenotypes, see refs. [ |
| Generalized MDR (GMDR) [ | Flexible framework by using GLMs | U | Yes | D, Q | No | Numerous phenotypes, see refs. [ |
| Pedigree-based GMDR (PGMDR) [ | Transformation of family data into matched case-control data | F | Yes | D, Q | No | Nicotine dependence [ |
| Support-Vector-Machine-based PGMDR (SVM-PGMDR) [ | Use of SVMs instead of GLMs | F | Yes | D, Q | Yes | Alcohol dependence [ |
| Unified GMDR (UGMDR) [ | Simultaneous handling of families and unrelateds | U and F | Yes | D, Q | No | Nicotine dependence [ |
| Cox-based MDR (Cox-MDR) [ | Transformation of survival time into dichotomous attribute using martingale residuals | U | Yes | S | No | Leukemia [ |
| Multivariate GMDR (MV-GMDR) [ | Multivariate modeling using generalized estimating equations | U | Yes | D, Q, MV | No | Blood pressure [ |
| Robust MDR (RMDR) [ | Handling of sparse/empty cells using ‘unknown risk’ class | U | No | D | Yes | Bladder cancer [ |
| Log-linear-based MDR (LM-MDR) [ | Improved factor combination by log-linear models and re-classification of risk | U | No | D | Yes | Alzheimer's disease [ |
| Odds-ratio-based MDR (OR-MDR) [ | OR instead of naïve Bayes classifier to classify its risk | U | No | D | Yes | Chronic Fatigue Syndrome [ |
| Optimal MDR (Opt-MDR) [ | Data driven instead of fixed threshold; | U | No | D | No | |
| MDR for Stratified Populations (MDR-SP) [ | Accounting for population stratification by using principal components; significance estimation by generalized EVD | U | No | D | No | |
| Pair-wise MDR (PW-MDR) [ | Handling of sparse/empty cells by reducing contingency tables to all possible two-dimensional interactions | U | No | D | Yes | Kidney transplant [ |
| Extended MDR (EMDR) [ | Evaluation of final model by | U | No | D | No | |
| Survival Dimensionality Reduction (SDR) [ | Classification based on differences between cell and whole population survival estimates; IBS to evaluate models | U | No | S | No | Rheumatoid arthritis [ |
| Survival MDR (Surv-MDR) [ | Log-rank test to classify cells; squared log-rank statistic to evaluate models | U | No | S | No | Bladder cancer [ |
| Quantitative MDR (QMDR) [ | Handling of quantitative phenotypes by comparing cell with overall mean; | U | No | Q | No | Renal and Vascular End-Stage Disease [ |
| Ordinal MDR (Ord-MDR) [ | Handling of phenotypes with >2 classes by assigning each cell to most likely phenotypic class | U | No | O | No | Obesity [ |
| MDR with Pedigree Disequilibrium Test (MDR-PDT) [ | Handling of extended pedigrees using pedigree disequilibrium test | F | No | D | No | Alzheimer’s disease [ |
| MDR with Phenomic Analysis (MDR-Phenomics) [ | Handling of trios by comparing number of times genotype is transmitted versus not transmitted to affected child; analysis of variance model to assesses effect of PC | F | No | D | No | Autism [ |
| Aggregated MDR (A-MDR) [ | Defining significant models using threshold maximizing area under ROC curve; aggregated risk score based on all significant models | U | No | D | No | Juvenile idiopathic arthritis [ |
| Model-based MDR (MB-MDR) [ | Test of each cell versus all others using association test statistic; association test statistic comparing pooled high-risk and pooled low-risk cells to evaluate models | U | No | D, Q, S | No | Bladder cancer [ |
Cov = Covariate adjustment possible, Pheno = Possible phenotypes with D = Dichotomous, Q = Quantitative, S = Survival, MV = Multivariate, O = Ordinal.
Data structures: F = Family based, U = Unrelated samples.
aBasically, MDR-based methods are designed for small sample sizes, but some methods provide special approaches to deal with sparse or empty cells, typically arising when analyzing very small sample sizes.
Implementations of MDR-based methods
| Method | Ref | Implementation | URL | Consist/Sig | Cov |
|---|---|---|---|---|---|
| MDR | [ | Java | k-fold CV | Yes | |
| [ | R | Available upon request, contact authors | k-fold CV, bootstrapping | No | |
| [ | Java | sourceforge.net/projects/mdr/files/mdrpt/ | k-fold CV, permutation | No | |
| [ | R | cran.r-project.org/web/packages/MDR/index.html | k-fold CV, 3WS, permutation | No | |
| [ | C++/CUDA | sourceforge.net/projects/mdr/files/mdrgpu/ | k-fold CV, permutation | No | |
| [ | C++ | ritchielab.psu.edu/software/mdr-download | k-fold CV, permutation | No | |
| GMDR | [ | Java | k-fold CV | Yes | |
| PGMDR | [ | Java | k-fold CV | Yes | |
| SVM-GMDR | [ | MATLAB | Available upon request, contact authors | k-fold CV, permutation | Yes |
| RMDR | [ | Java | k-fold CV, permutation | Yes | |
| OR-MDR | [ | R | Available upon request, contact authors | k-fold CV, bootstrapping | No |
| Opt-MDR | [ | C++ | home.ustc.edu.cn/∼zhanghan/ocp/ocp.html | GEVD | No |
| SDR | [ | Python | sourceforge.net/projects/sdrproject/ | k-fold CV, permutation | No |
| Surv-MDR | [ | R | Available upon request, contact authors | k-fold CV, permutation | Yes |
| QMDR | [ | Java | k-fold CV, permutation | Yes | |
| Ord-MDR | [ | C++ | Available upon request, contact authors | k-fold CV, permutation | No |
| MDR-PDT | [ | C++ | ritchielab.psu.edu/software/mdr-download | k-fold CV, permutation | No |
| MB-MDR | [ | C++ | Permutation | No | |
| [ | R | cran.r-project.org/web/packages/mbmdr/index.html | Permutation | Yes | |
| [ | R | Permutation | Yes |
Ref = Reference, Cov = Covariate adjustment possible, Consist/Sig = Strategies used to determine the consistency or significance of model.
Figure 3.Overview of the original MDR algorithm as described in [2] on the left with categories of extensions or modifications on the right. The first stage is data input, and extensions to the original MDR method dealing with other phenotypes or data structures are presented in the section ‘Different phenotypes or data structures’. The second stage comprises CV and permutation loops, and approaches addressing this stage are given in section ‘Permutation and cross-validation strategies’. The following stages encompass the core algorithm (see Figure 4 for details), which classifies the multifactor combinations into risk groups, and the evaluation of this classification (see Figure 5 for details). Methods, extensions and approaches mainly addressing these stages are described in sections ‘Classification of cells into risk groups’ and ‘Evaluation of the classification result’, respectively.
Figure 4.The MDR core algorithm as described in [2]. The following steps are executed for every number of factors (). (1) From the exhaustive list of all possible -factor combinations select one. (2) Represent the selected factors in -dimensional space and estimate the cases to controls ratio in the training set. (3) A cell is labeled as high risk () if the ratio exceeds some threshold () or as low risk otherwise.
Figure 5.Evaluation of cell classification as described in [2]. The accuracy of every -model, i.e. -factor combination, is assessed in terms of classification error (CE), cross-validation consistency () and prediction error (PE). Among all -models the single model with lowest average CE is selected, yielding a set of best models for each . Among these best models the one minimizing the average PE is selected as final model. To determine statistical significance, the observed is compared to the empirical distribution of under the null hypothesis of no interaction derived by random permutations of the phenotypes.