| Literature DB >> 34436628 |
Simon Emanuel Harnqvist1, Cooper Alastair Grace2, Daniel Charlton Jeffares2.
Abstract
Which variables determine the constraints on gene sequence evolution is one of the most central questions in molecular evolution. In the fission yeast Schizosaccharomyces pombe, an important model organism, the variables influencing the rate of sequence evolution have yet to be determined. Previous studies in other single celled organisms have generally found gene expression levels to be most significant, with numerous other variables such as gene length and functional importance identified as having a smaller impact. Using publicly available data, we used partial least squares regression, principal components regression, and partial correlations to determine the variables most strongly associated with sequence evolution constraints. We identify centrality in the protein-protein interactions network, amino acid composition, and cellular location as the most important determinants of sequence conservation. However, each factor only explains a small amount of variance, and there are numerous variables having a significant or heterogeneous influence. Our models explain more than half of the variance in dN, raising the possibility that future refined models could quantify the role of stochastics in evolutionary rate variation.Entities:
Mesh:
Year: 2021 PMID: 34436628 PMCID: PMC8599406 DOI: 10.1007/s00239-021-10028-y
Source DB: PubMed Journal: J Mol Evol ISSN: 0022-2844 Impact factor: 2.395
Fig. 1Maximum likelihood phylogenetic tree of the four fission yeast (Schizosaccharomyces) species based on a concatenated alignment of 50 orthologous groups, chosen at random. Created with MEGA 10.2.5 using the Tamura-Nei model assuming uniform rates across all sites, using Nearest-Neighbor-Interchange as ML heuristic and rooted with the S. japonicus sequences. Visualised with FigTree 1.4.4 (tree.bio.ed.ac.uk/software/figtree/)
Fig. 2A Percent variance in dN explained by each variable using principal components regression compared to VIP scores for each variable in a partial least squares regression. Variables that are important in both models are closer to the top right corner. B VIP scores per variable in a PLS model with dN as dependent variable, grouped by variable group. C Percent variance explained per variable in a PCR model with dN as dependent variable, grouped per variable group. D Partial correlations (Spearman) between each variable and dN. Only significant correlations (Bonferroni-adjusted p < 0.05) shown
Fig. 3A Percent variance in phyloP explained by each variable using principal components regression compared to VIP scores for each variable in a partial least squares regression. Variables that are important in both models are closer to the top right corner. B VIP scores per variable in a PLS model with phyloP as dependent variable, grouped by variable group. C Percent variance explained per variable in a PCR model with phyloP as dependent variable, grouped per variable group. D Partial correlations (Spearman) between each variable and phyloP. Only significant correlations (Bonferroni-adjusted p < 0.05) shown
A comparison of the variance in dependent variable explained the holdout test set of three regression techniques applied to sequence evolution constraint prediction
| PLS (%) | PCR (%) | RF (%) | |
|---|---|---|---|
| phyloP | 32.4 | 31.0 | 42.4 |
| d | 52.6 | 51.3 | 58.7 |
The RF model was trained to provide a comparison of model performance
PLS partial least squares, PCR principal components regression, RF random forest