| Literature DB >> 27770175 |
Omar Navarro Leija1,2, Sanju Varghese1,2, Mira V Han3,4.
Abstract
Evolutionary constraint for insertions and deletions (indels) is not necessarily equal to constraint for nucleotide substitutions for any given region of a genome. Knowing the variation in indel-specific evolutionary rates across the sequence will aid our understanding of evolutionary constraints on indels, and help us infer how indels have contributed to the evolution of the sequence. However, unlike for nucleotide substitutions, there has been no phylogenetic method that can statistically infer significantly different rates of indels across the sequence space independent of substitution rates. Here, we have developed a software that will find sites with accelerated evolutionary rates specific to indels, by introducing a scaling parameter that only applies to the indel rates and not to the nucleotide substitution rates. Using the software, we show that we can find regions of accelerated rates of indels in the protein alignments of primate genomes. We also confirm that the sites that have high rates of indels are different from the sites that have high rates of nucleotide substitutions within the protein sequences. By identifying regions with accelerated rates of indels independent of nucleotide substitutions, we will be able to better understand the impact of indel mutations on protein sequence evolution.Entities:
Keywords: Deletions; Evolutionary constraint; Insertions; Substitution rate
Mesh:
Substances:
Year: 2016 PMID: 27770175 PMCID: PMC5080320 DOI: 10.1007/s00239-016-9761-9
Source DB: PubMed Journal: J Mol Evol ISSN: 0022-2844 Impact factor: 2.395
Models newly implemented in the software
| Model |
|
|
|
|---|---|---|---|
| Parameters |
|
|
|
Three models are newly implemented in the extended version of PHAST. The F84 by Felsenstein and Churchill, F84ε-relaxed which is a modified version of the model F84ε (Rivas and Eddy 2008), and F84ε-relaxed + ρ indel in which we introduce the scaling parameter to modify the indel rates
Model comparisons newly implemented in the software
| Comparison |
| Description | Score |
|---|---|---|---|
|
| 1 | Does scaling on the substitution rates (=scaling branch lengths) improve the model fit? | phyloP |
|
| 1 | Does scaling on the indel rates improve the model fit? | indelP |
Two model comparisons are implemented with the likelihood ratio test. First comparison compares F84 to F84 + ρ using the native branch scaling in the original program phyloP. Second comparison compares F84ε-relaxed to F84ε-relaxed + ρ indel using our newly implemented indel rate scaling. The p-value from the second likelihood ratio test is transformed into an indelP score
Number of sites with significantly different indel rates and nucleotide substitution rates
| Type of event |
| Significant sites | Total sites |
|---|---|---|---|
| Indel | 0.05 | 19,134 | 942,411 |
| Indel | 5.3e−8 | 102 | 942,411 |
| Nucleotide substitution | 0.05 | 47,243 | 942,411 |
| Nucleotide substitution | 5.3e−8 | 177 | 942,411 |
Number of sites with a significant likelihood ratio test with and without correction for multiple testing
Fig. 1Volcano plot of the likelihood ratio test and the estimated scale parameter for each site in the alignments. Plot of significance versus scaling resulting from the model comparison on 942,411 sites. a F84 versus F84 + ρ tests for significantly different rates of nucleotide substitutions. b F84ε-relaxed versus F84ε-relaxed + ρ indel tests for significantly different rates of indels. Positive values in the X-axis represent sites with accelerated rates (scaling > 1), while negative values in the X-axis represent sites with conserved rates (scaling < 1). Y-axis represents the p-value from the likelihood ratio test of the model comparison
Fig. 2phyloP score and indelP score for gene family podoplanin. Alignment of an example gene family podoplanin (PDPN) with phyloP scores and indelP scores for each site. Scores are calculated by log(p-value) multiplied by +1 (conservation, scaling < 1) or −1 (acceleration, scaling > 1). Reference line represents scores that are zero (p-value = 1), while the scales of the bars are drawn by normalizing the scores for each family in the range of [− 50, 50]
Overlap between sites with significantly different indel rates and sites with significantly different nucleotide substitution rates
|
| Significant for indels and nucleotide substitutions | Significant for indels, but not significant for nucleotide substitutions | Not significant for indels, but significant for nucleotide substitutions | Not significant for indels nor for nucleotide substitutions |
|---|---|---|---|---|
| 0.05 | 1509 | 17,544 | 45,653 | 877,624 |
| 5.3e−8 | 0 | 102 | 177 | 942,132 |
Number of sites with significantly different rates for indels and nucleotide substitutions at significance level of 0.05 and 5.3e−8 (Bonferroni-corrected)
Fig. 3Relationship between phyloP score and indelP score. Plot of phyloP versus indelP scores for 942,411 sites
Comparison of results based on three different alignment data
| Filtering |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 15 gaps | 0.0066 | 0.0371 | 0.5449 | 0.4551 | 0.9980 | 0.25, 0.26, 0.26, 0.22 |
| 30 gaps | 0.0085 | 0.0617 | 0.5135 | 0.4865 | 0.9981 | 0.25, 0.26, 0.26, 0.22 |
| 45 gaps | 0.0090 | 0.0947 | 0.5018 | 0.4982 | 0.9981 | 0.25, 0.26, 0.26, 0.22 |
Estimated parameters and results of the indel model comparison for alignment data filtered by different amount of gaps. Estimation for rate of insertions (λ) and deletions (μ) is influenced by the amount of gaps in the dataset. More gaps in the data lead to higher rates of insertions and deletions, and larger number of total insertions and deletions in length
Fig. 4Example of misalignment leading to significant indelP scores. Misalignment in the sequence data can look like accelerated indel rates