| Literature DB >> 34252920 |
Chaitanya Aluru1, Mona Singh1.
Abstract
MOTIVATION: Protein domain duplications are a major contributor to the functional diversification of protein families. These duplications can occur one at a time through single domain duplications, or as tandem duplications where several consecutive domains are duplicated together as part of a single evolutionary event. Existing methods for inferring domain-level evolutionary events are based on reconciling domain trees with gene trees. While some formulations consider multiple domain duplications, they do not explicitly model tandem duplications; this leads to inaccurate inference of which domains duplicated together over the course of evolution.Entities:
Year: 2021 PMID: 34252920 PMCID: PMC8275333 DOI: 10.1093/bioinformatics/btab329
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Two possible duplication patterns and their corresponding phylogenetic trees. (a) A protein has two domain instances, depicted by a triangle and star, labeled ‘A’ and ‘B’, respectively. Here, a single tandem duplication occurs, and the entire stretch containing both domains is duplicated at once. The resulting protein contains two instances each of the star and triangle domains. Without loss of generality, after duplication we refer to the first copy of the two domains in the second sequence as corresponding to the ancestral domains and thus refer to them by ‘A’ and ‘B’. The domains in the second copy are considered new domains and are referred to by the new labels ‘C’ and ‘D’. (b) A different set of duplication events with the same starting protein. Here, we see two individual domain duplications; first the triangle domain duplicates and then the star domain duplicates. In the end, the protein has two triangle and two star domains, as in part (a). Again, after duplication the first copy of each of these two domains is assumed to be the ancestral one. (c and d) Phylogenetic trees built from the domains found in the final proteins from parts (a) and (b). The leaf labels correspond to the name of the domain, with position in the sequence marked below. Ancestral node labels are obtained by taking the name of its leftmost child (as by design this is assumed to be the original copy after the duplication). Their positions are unknown and must be inferred. Note that the topologies of the trees are identical, so we cannot distinguish between tandem duplication and two individual duplication events from tree topology alone. However, by including the relative position of each domain in the tree, we see that the leaves in (c) are interleaved, indicating a tandem duplication, while the leaves in (d) are not, indicating that two separate duplication events occurred.
Accuracy of mappings obtained by the LCA method, our full ILP and MultRec
| Event distance | |||||
|---|---|---|---|---|---|
| 0.010 | 0.025 | 0.050 | 0.075 | 0.100 | |
| LCA | 0.993 | 1.000 | 1.000 | 1.000 | 1.000 |
| Full ILP | - | 0.999 | 1.000 | 1.000 | 1.000 |
| MultRec | 0.990 | 0.997 | 1.000 | 1.000 | 1.000 |
Note: For each event distance, we give the average mapping accuracies obtained across simulations when using mappings obtained by each of three methods. All three methods have nearly perfect performance in all instances, indicating that in most cases, using mappings from the fastest method (LCA) is sufficient. We omit results for the ILP at an event distance of 0.01 because it was unable to run in a reasonable time.
Fig. 2.Performance of methods in inferring tandem domain duplications. Average precision (left), recall (middle) and F1-scores (right) of MultRec (blue), our heuristic (orange) and our full ILP (green) when run on 50 simulations each at the five given event distances. Error bars depict one standard deviation around the mean. At high event distances, all three methods have excellent precision, recall and F1-scores. At lower event distances, the heuristic and ILP have significantly higher precision than MultRec but slightly lower recalls. The heuristic and ILP also have significantly higher F1-scores than MultRec at low event distances. While due to runtime, we do not report results for the ILP at an event distance of 0.01, we observe that the heuristic achieves perfect scores at high event distances and is only slightly worse than the ILP at an event distance of 0.025.
Average runtime in seconds of our heuristic, our full ILP and the MultRec program
| Event distance | |||||
|---|---|---|---|---|---|
| 0.010 | 0.025 | 0.050 | 0.075 | 0.100 | |
| Heuristic | 4.041 | 1.369 | 0.48 | 0.13 | 0.062 |
| Full ILP | - | 282.734 | 35.214 | 1.706 | 0.333 |
| MultRec | 0.975 | 0.185 | 0.041 | 0.018 | 0.012 |
Note: For each event distance, the average runtime in seconds across 50 simulations is reported. Our heuristic scales nearly as well as MultRec while providing significantly better tandem duplication inference. In contrast, at event distance 0.01, the ILP did not finish running on the 50 simulations in under 24 h.
Fig. 3.Lineage-specific filamin domain expansions in Filamin-A proteins. Shown are clades of domains whose leaves consist of domains found only in (a) Hirudo medicinalis or (b) Trichoplax adhaerens. These subtrees of the full domain tree represent lineage-specific expansions in each species. The number labeling each leaf refers to the relative position of each domain in the constituent protein. Gray rectangles specify the tandem duplication events inferred by our heuristic approach. All internal nodes not covered by these gray rectangles correspond to single domain duplications. Branch lengths are not representative. In H.medicinalis, we see a series of single duplications, with one small tandem duplication in the middle. On the other hand, T.adhaerens exhibits a very clear pattern of tandem duplications. Models that attempt to minimize the number of tandem duplications without considering domain positions would be mostly correct for (b), but not for (a). With our model, we can accurately infer duplication events in both scenarios.