| Literature DB >> 34343318 |
Sudhir Kumar1,2,3, Sudip Sharma1,2.
Abstract
We introduce a supervised machine learning approach with sparsity constraints for phylogenomics, referred to as evolutionary sparse learning (ESL). ESL builds models with genomic loci-such as genes, proteins, genomic segments, and positions-as parameters. Using the Least Absolute Shrinkage and Selection Operator, ESL selects only the most important genomic loci to explain a given phylogenetic hypothesis or presence/absence of a trait. ESL models do not directly involve conventional parameters such as rates of substitutions between nucleotides, rate variation among positions, and phylogeny branch lengths. Instead, ESL directly employs the concordance of variation across sequences in an alignment with the evolutionary hypothesis of interest. ESL provides a natural way to combine different molecular and nonmolecular data types and incorporate biological and functional annotations of genomic loci in model building. We propose positional, gene, function, and hypothesis sparsity scores, illustrate their use through an example, and suggest several applications of ESL. The ESL framework has the potential to drive the development of a new class of computational methods that will complement traditional approaches in evolutionary genomics, particularly for identifying influential loci and sequences given a phylogeny and building models to test hypotheses. ESL's fast computational times and small memory footprint will also help democratize big data analytics and improve scientific rigor in phylogenomics.Entities:
Keywords: functional genomics; machine learning; phylogenetics; phylogenomics; total evidence
Mesh:
Year: 2021 PMID: 34343318 PMCID: PMC8557465 DOI: 10.1093/molbev/msab227
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Fig. 1.A schematic representation of models in evolutionary sparse learning (ESL). (a) There are positions in the sequence alignment, so the regression model (b) can contain as many as variables, that is, features in machine learning. The regression coefficient is the degree of association between the base configuration at position with the function of the outcome . The outcome is assigned to each sequence based on the phylogenetic relationship or the presence/absence of a trait. (c) One-hot encoding of the sequence alignment in which a column of bits represents each base. ESL estimates regression coefficient () for every bit-column () for every position . In the response vector, all sequences belonging to the target clade (black) are represented by +1. Those in the other clade (blue) are represented by −1. (d) Positions can be clustered into groups (e.g., genes) for bilevel sparsity.
Fig. 2.ESL analysis of a multiple sequence alignment of plants species. (a) The plant phylogeny with the sequences in one focused clade marked as +1 (black) and the rest marked as −1 (blue) corresponding to the sequence assignment (branch #1). Two other branches are also highlighted (light green and pink). (b) Genes included in the ESL model after using sparse group lasso and the ridge regression for branch #1. (c) The distributions of scores of 80 genes included in the ESL model for branch #1. (d) A violin plot showing the distribution of for all the sequences using the ESL model for branch #1. (e) ROC curve showing the tradeoff between the true positive rate and the false positive rate of classification of the ESL model for the phylogenetic partition induced by branch #1 (black). ROC curves of two other branches are also shown (light green and pink lines). ESL analyses were performed independently with similar settings for all three branches. ROC curves for different branches were calculated using genes selected in each ESL model. The areas under ROC curves (AUC) are also presented. (f) Scatter plots showing the relationship between the proportion of bootstrap ESL models in which a gene appeared and its GSS (brown) and the coefficient of variation (CV) of . ESL models were generated in the SLEP (Liu et al. 2011) software in MATLAB by analyzing a multiple sequence alignment of 620 genes (290,718 sites) from 103 Plant species. The “sgLogisticR” function for bilevel logistic sparse group lasso regression was applied with starting feature regularization parameter ( = 0.1) and group regularization parameter ( = 0.2). The square root of gene length was used as group weight. SLEP uses the Moreau–Yosida Regularization algorithm, and we specified 100 iterations to obtain an optimal parameter. In these iterations, the regularization parameters (,) were automatically optimized (0.0226 and 0.0129, respectively). We conducted Ridge regression analysis with regularization, the “gLogisticR” functions for genes selected in b. The starting regularization parameter was used, and the square root of gene length was used as group weight.