| Literature DB >> 27559342 |
Lahiru Iddamalgoda1, Partha S Das2, Achala Aponso1, Vijayaraghava S Sundararajan3, Prashanth Suravajhala4, Jayaraman K Valadi5.
Abstract
Data mining and pattern recognition methods reveal interesting findings in genetic studies, especially on how the genetic makeup is associated with inherited diseases. Although researchers have proposed various data mining models for biomedical approaches, there remains a challenge in accurately prioritizing the single nucleotide polymorphisms (SNP) associated with the disease. In this commentary, we review the state-of-art data mining and pattern recognition models for identifying inherited diseases and deliberate the need of binary classification- and scoring-based prioritization methods in determining causal variants. While we discuss the pros and cons associated with these methods known, we argue that the gene prioritization methods and the protein interaction (PPI) methods in conjunction with the K nearest neighbors' could be used in accurately categorizing the genetic factors in disease causation.Entities:
Keywords: data mining; inherited diseases; machine learning; protein-protein interaction; single nucleotide polymorphism
Year: 2016 PMID: 27559342 PMCID: PMC4979376 DOI: 10.3389/fgene.2016.00136
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Flow chart of the Ada-Boost Algorithm.
Comparison of the Ensemble learning approaches.
| AdaBoost | Adaboosting (adaptive boosting) is one of the most popular boosting algorithms. This algorithm is characterized by its adaptive changes to fit sample weights during the boosting process, according to the weighted classification error identified from the last training (Vapnik, | Advantage of the decision tree as the basic “weak” classifier (Jiaxin et al., |
| LogitBoost | Logiboost is an improved version of the AdaBoost algorithm. The main difference between Adaboost and Logiboost is that Logiboost puts confidence on the binomial log-likelihood as a loss function, more likely in binary classification than the exponential basis underlying the Adaboost algorithms. | Compared with Adaboost, LogitBoost is found to be more effective in case of noisy data, easier to implement and does not require tuning and model or kernel selection like neural networks or support vector machines. LogitBoost can work with Logit models, decision stump, or decision trees (Friedman et al., |
| Random Forest trees | Random forest is a boosting method implemented for voting most popular tree after many classification trees have been grown. In large data sets, the method shows outstanding efficiency. | This method has several advantages, like few parameters to be adjusted, no over-fitting problem, fast computational speed, and a strong ability of anti-noise characteristics. In addition, Random forests have a built-in method to estimate the importance of features. This method is usefully to prioritize the features by their importance and reduce the feature set in order to improve the computational complexity (Jiaxin et al., |
| L2boosting | L2boosting is a gradient boosting algorithm for optimizing arbitrary loss functions where component based linear models are made use of based learners. It has shown better performance compared to decision stumps (tree with two terminal nodes) and other more common competitors, particularly when the predictor space is multi-dimensional. | In addition, L2boosting works well with both regression and classification problems. It shows comparably better performance for classification related problems like LogitBoost (Jiaxin et al., |
| Stochastic gradient regression | Stochastic gradient is a regression prediction method. This method uses regression tree as a base learner. The optimization of the gradient descent, stochastic gradient regression utilizes the pseudo-residuals resulting from negative gradient of loss function to set up iterative regression tree. | This algorithm randomly selects part of the pseudo-residual to make regression tree instead of the whole pseudo-residuals. This model can be a linear combination of some regression trees (Jiaxin et al., |
| Support Vector Machine | Support Vector Machine (SVM), also known as “Support Vector Network” is a machine learning method for binary classification problems, although implementations of multi-class SVMs exist to map input vectors to a multi-dimensional feature space. A linear decision environment is built with special properties ensuring high generalization ability of a machine learning approach. | The idea behind the support vector network has been extensively implemented in biology with some method for the restricted case where training data can be separated without errors, further extending this result to non-separable training data (Cortes and Vapnik, |
| Decision trees | This method applies to scenarios in which specific decision alternatives cannot be predicted with high level of confidence. It is a hierarchical modeling system for supervised learning where local regions are identified by a sequence of recursive splits in few steps. A tree here is composed of decision nodes and terminal leaves. The trees can be of various types like univariate trees, classification trees, regression trees etc. When making a decision, a lot of different factors are taken as inputs, the decision tree uses its own feature selection strategy to select only those useful for classification (Breiman et al., | Decision tree solves complex decisional problems having significant uncertainty(Safavian and Landgrebe, |
Figure 2Performance of the approaches as explained from the methods (A) Rate of performance evaluation comparison of the each classification methods; the accuracy of the prediction (ACC); the area under the receiver operating characteristic (ROC) curve (AUC); the balanced error rate (BER) and the Matthews' correlation coefficient (MCC). (B) Performance of the 7 approaches based on the true positive rate vs. false positive rate.
Figure 3A SNP pathway based Association Method (SPAM): First step is SNP significance analysis and the corresponding risk calculation. Data of case and control for each and risk ratio is calculated from relationships between the SNP and complex diseases. The second step is reconstruction of KEGG pathway and analysis of the reconstructed network attribution. As per the format, KEGG pathway represents a network where a node represents metabolite and one edge represents some enzyme or a gene cluster. The third step involves screening of SNPs and mapping of SNPs to reconstructed network (Hoh and Ott, 2003). The fourth step is calculation of the two integrated measurements of RS scoring and prioritizing the pathway.
Figure 4Gene prioritization methods. (A) (a) Entities and their relations are retrieved from the source databases which are then filtered, sorted and disambiguated to create the graph. From gene expression studies or positional studies or GO, a list of genes are obtained for prioritization (Arrais and Oliveira, 2010) (b) Calculation of the gene semantic similarity scores to obtain the three gene similarity profiles; (B) Combining the three gene similarity profiles and the phenotype similarity profile to calculate three dimensional features for classification; (C) Training classifier using the known associations and sufficient unrelated gene-disease pairs as training sets; (D) Prediction of disease-gene associations; (E) Prioritization of candidate genes (He and Jiang, 2012).