| Literature DB >> 34469521 |
Gil Loewenthal1, Dana Rapoport1, Oren Avram1, Asher Moshe1, Elya Wygoda1, Alon Itzkovitch1, Omer Israeli1, Dana Azouri1,2, Reed A Cartwright3,4, Itay Mayrose2, Tal Pupko1.
Abstract
Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here, we introduce several improvements to indel modeling: 1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; 2) we introduce numerous summary statistics that allow approximate Bayesian computation-based parameter estimation; 3) we develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical data sets; and 4) using a model-selection scheme, we test whether the richer model better fits biological data compared with the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical data sets and that, for the majority of these data sets, the deletion rate is higher than the insertion rate.Entities:
Keywords: alignments; approximate Bayesian computation; evolutionary models; indels; molecular evolution
Mesh:
Year: 2021 PMID: 34469521 PMCID: PMC8662616 DOI: 10.1093/molbev/msab266
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
The Simple Indel Model (SIM) and Rich Indel Model (RIM) Parameters and Their Description.
| Model | Parameter Name | Description |
|---|---|---|
| SIM, RIM |
| The sequence length at the root of the tree |
| SIM |
| Indel-to-substitution-rate ratio |
|
| Indel length distribution parameter | |
| RIM |
| Insertion-to-substitution-rate ratio |
|
| Deletion-to-substitution-rate ratio | |
|
| Insertion length distribution parameter | |
|
| Deletion length distribution parameter |
Fig. 1.Model accuracy in simulations: inferred parameters (, , , , and ) are correlated to the parameters used for simulation. Each point represents a single simulation inference for the corresponding parameter against the real value. Each graph is based on 200 independent simulations. The dashed line is the identity line. (a) The performance when the input alignment is the true alignment, that is, the performance in ideal settings in which the input alignment is error free. The obtained values are 0.93, 0.93, 0.86, 0.69, and 0.94 for , , , , and , respectively. (b) The performance without correcting for biases introduced to the empirical alignment (i.e., although the summary statistics of the analyzed alignment were derived from MAFFT alignments, the summary statistics of the simulated alignments within the ABC inference were inferred based on alignments not inferred using MAFFT). In each case, the true alignment was unaligned and realigned using MAFFT. The obtained values are 0.55, 0.64, 0.67, 0.49, and 0.75 for , , , , and , respectively. (c) The summary statistics of the simulated alignments within the ABC inference scheme were corrected for biases introduced by MAFFT. The obtained values are 0.87, 0.92, 0.81, 0.71, and 0.93 for , , , , and , respectively.
The 27 Summary Statistics Used in the ABC Scheme.
| No. | Summary Statistics |
|---|---|
| 1 | Total number of gap blocks in the alignment |
| 2 | Total number of unique gap blocks in the alignment |
| 3 | Average gap block length |
| 4 | Average unique gap block length |
| 5 | Number of gap blocks of length one |
| 6 | Number of gap blocks of length two |
| 7 | Number of gap blocks of length three |
| 8 | Number of gap blocks of length four or more |
| 9 | Alignment length |
| 10 | Minimum length of sequence in the alignment |
| 11 | Maximum length of sequence in the alignment |
| 12 | Number of MSA columns with zero gap |
| 13 | Number of MSA columns with one gap |
| 14 | Number of MSA columns with two gaps |
| 15 | Number of MSA columns with |
| 16 | Number of gaps of length one that appear only in one sequence |
| 17 | Number of gaps of length one that are shared between exactly two sequences |
| 18 | Number of gaps of length one that are shared between exactly |
| 19 | Number of gaps of length two that appear only in one sequence |
| 20 | Number of gaps of length two that are shared between exactly two sequences |
| 21 | Number of gaps of length two that are shared between exactly |
| 22 | Number of gaps of length three that appear only in one sequence |
| 23 | Number of gaps of length three that are shared between exactly two sequences |
| 24 | Number of gaps of length three that are shared between exactly |
| 25 | Number of gaps of length at least four that appear only in one sequence |
| 26 | Number of gaps of length at least four that are shared between exactly two sequences |
| 27 | Number of gaps of length at least four that are shared between exactly |
Model Selection Accuracy.
| Simulated Model | Accuracy |
|---|---|
| SIM | 0.98 |
| RIM | 0.77 |
Note.—Accuracy is computed based on 100 simulations for each indel model. For example, out of 100 MSAs simulated under RIM, the model selection approach correctly identified 77 as RIM.
Fig. 2.Misclassification rates depend on the similarity between insertion and deletion parameters. The errors depend on the absolute difference between and and the differences between and . All simulations were under the RIM model. In dark, simulations that were correctly classified as RIM and in light, cases which were misclassified as SIM. Results are based on 200 simulated alignments: 100 simulated with the RIM model and 100 simulated with the SIM model. Sequences were simulated based on the topology of the ENOG504B73R data set.
Model Selection for Various Taxonomical Groups.
| Group | No. of SIM | No. of RIM | Percentage of RIM | |
|---|---|---|---|---|
| Prokaryotes |
| 147 | 120 | 44.94 |
|
| 106 | 41 | 27.89 | |
|
| 185 | 88 | 32.23 | |
|
| 189 | 81 | 30.00 | |
|
| 309 | 85 | 21.57 | |
| Tenericutes | 144 | 141 | 49.47 | |
| Vibrionales | 263 | 144 | 35.38 | |
| Eukaryotes | Brassicales | 219 | 43 | 16.41 |
| Chlorophyta | 291 | 122 | 29.54 | |
| Ciliophora | 176 | 156 | 46.99 | |
| Drosophilidae | 245 | 228 | 48.20 | |
| Primates | 66 | 6 | 8.33 | |
| Rhabditida | 288 | 114 | 28.36 | |
| Rodentia | 201 | 36 | 15.19 | |
| Saccharomycetaceae | 285 | 304 | 51.61 |
Model Parameters across Various Taxonomical Groups for Protein Data Sets Classified as (a) RIM and (b) SIM.
| Group |
|
|
| Mean Insertion Length | Mean Deletion Length |
|---|---|---|---|---|---|
| ( | |||||
|
| 630.7 | 0.0142 | 0.0216 | 6.74 | 7.65 |
|
| 425.0 | 0.0116 | 0.0201 | 5.64 | 6.40 |
|
| 771.9 | 0.0101 | 0.0185 | 6.47 | 6.72 |
|
| 751.4 | 0.0101 | 0.0152 | 6.49 | 6.59 |
|
| 591.6 | 0.0113 | 0.0180 | 6.34 | 7.26 |
| Tenericutes | 788.3 | 0.0103 | 0.0160 | 6.19 | 6.17 |
| Vibrionales | 661.2 | 0.0123 | 0.0180 | 6.49 | 7.36 |
| Brassicales | 1395.6 | 0.0177 | 0.0364 | 7.14 | 8.70 |
| Chlorophyta | 800.7 | 0.0206 | 0.0266 | 8.29 | 8.30 |
| Ciliophora | 905.6 | 0.0198 | 0.0248 | 7.98 | 7.95 |
| Drosophilidae | 1826.7 | 0.0141 | 0.0390 | 6.34 | 7.87 |
| Primates | 1376.2 | 0.0193 | 0.0344 | 7.01 | 7.49 |
| Rhabditida | 799.2 | 0.0235 | 0.0345 | 7.59 | 7.93 |
| Rodentia | 1113.1 | 0.0154 | 0.0374 | 6.59 | 8.70 |
| Saccharomycetaceae | 869.5 | 0.0084 | 0.0103 | 6.64 | 5.51 |
Fig. 3.Deletion rates are mostly higher than insertion rates, whereas no significant trend is found for the length distribution. Left panel: a scatter plot of insertion rate () versus deletion rate (). Right panel: a scatter plot of mean insertion length versus mean deletion length. A total of 1,709 protein data sets across 15 taxonomic groups for which the RIM model was selected are included in the analysis. The dashed line is the identity line, , in both panels.
Fig. 4.Indel dynamics for empirical DNA sequences. Left panel: a scatter plot of insertion rate () versus deletion rate (). Right panel: a scatter plot of mean insertion length versus mean deletion length. For the 81 yeast intron data sets analyzed, the number of species ranged from 4 to 7. Shown are results for the 59% data sets for which the RIM model was selected. The dashed line is the identity line, , in both panels.
| Group |
|
| Mean Indel Length |
|---|---|---|---|
| ( | |||
|
| 605.5 | 0.0468 | 9.17 |
|
| 492.1 | 0.0251 | 7.26 |
|
| 776.0 | 0.0187 | 7.20 |
|
| 766.0 | 0.0221 | 7.88 |
|
| 570.2 | 0.0157 | 7.25 |
| Tenericutes | 692.9 | 0.0328 | 7.89 |
| Vibrionales | 658.2 | 0.0287 | 8.41 |
| Brassicales | 795.0 | 0.0493 | 9.35 |
| Chlorophyta | 677.2 | 0.0617 | 10.10 |
| Ciliophora | 650.0 | 0.0629 | 9.89 |
| Drosophilidae | 1428.0 | 0.0581 | 8.67 |
| Primates | 1091.4 | 0.0384 | 8.80 |
| Rhabditida | 578.6 | 0.0557 | 9.39 |
| Rodentia | 834.3 | 0.0512 | 9.39 |
| Saccharomycetaceae | 1090.2 | 0.0127 | 6.54 |