| Literature DB >> 35758772 |
Xiaoyue Cui1, Yifan Xue1,2, Collin McCormack1,2, Alejandro Garces2, Thomas W Rachman1, Yang Yi1,2, Maureen Stolzer2, Dannie Durand2.
Abstract
MOTIVATION: Simulation is an essential technique for generating biomolecular data with a 'known' history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty years of molecular evolution research have produced a rich repertoire of simulation methods. However, current models do not capture the stringent constraints acting on the domain insertions, duplications, and deletions by which multidomain architectures evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature. Modeling these stringent constraints on domain order and co-occurrence is a fundamental challenge in domain architecture simulation that does not arise with sequence and gene family simulation.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35758772 PMCID: PMC9236583 DOI: 10.1093/bioinformatics/btac242
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Example multidomain protein: proto-oncogene tyrosine-protein kinase Src in human. (a) Domains in the sequence are identified by PFAM (Mistry ) HMMs: SH3 (PF00018), SH2 (PF00017), and a protein tyrosine kinase (PF07714). Sequence in linker regions represented as (). (b) The 3D structure of Src, with the SH2, SH3, and kinase folds shown in purple, red, and blue. (c) Src domain architecture, showing its constituent domains in N- to C-terminal order. (d) A sequence LOGO for the PFAM SH3 domain model.
Fig. 2.Schematic showing changes in domain architecture via insertion, duplication, and deletion of domains
Fig. 3.The state transition diagram showing states adjacent to a DA of length n = 3. Each stack of circles on the right represents the N states that can be reached by a domain gain at the associated position.
Summary statistics for four training datasets
| Primates | Fish | Fly | Cnidaria | |
|---|---|---|---|---|
| Unique DAs | 7144 | 8985 | 5483 | 6271 |
| Domains | 1132 | 1131 | 1017 | 1159 |
| Unique bigrams | 2865 | 3357 | 2884 | 3533 |
| Mean DA length | 5.4 | 6.2 | 5.2 | 3.6 |
| Median DA length | 3 | 4 | 3 | 3 |
Fig. 4.MCMC convergence assessment. (a) Gelman Rubin diagnostic applied to DA lengths sampled every 100 iterations with the primate dataset. (b) Event acceptance rate.
Fig. 5.Final DA length as a function of chain length (horizontal axis not to scale). Top panels: Close-up view of the same distribution. Mean DA length shown as solid dots. Horizontal line represents mean length of genuine DAs. Length distributions of genuine DAs are plotted in the rightmost columns. (a) Primates. (b) Fish. (c) Drosophila. (d) Cnidaria.
Domain combination statistics in genuine and simulated DAs () are highly correlated (Pearson correlation coefficient, for all tests)
| Primate | Fish | Drosophila | Cnidaria | |
|---|---|---|---|---|
| Singleton frequency | 0.998 | 0.999 | 0.994 | 0.994 |
| Bigram frequency | 0.997 | 0.998 | 0.994 | 0.988 |
| Trigram frequency | 0.969 | 0.968 | 0.986 | 0.962 |
| Pair co-occurrence | 0.868 | 0.832 | 0.944 | 0.888 |
| Unique neighbors | 0.925 | 0.922 | 0.918 | 0.946 |
| Wtd bigram promiscuity | 0.843 | 0.870 | 0.925 | 0.821 |
| Mean tandem array length | 0.889 | 0.927 | 0.935 | 0.942 |
| DA probability quantiles | 0.976 | 0.973 | 0.964 | 0.984 |
Fig. 6.Frequency of accepted gain (left axis) and loss (right axis) positions; bars for gains (blue) and losses (red) are interleaved, starting with gains.
The 10 domains with the most copies in tandem arrays
| Genuine | Simulated | |||
|---|---|---|---|---|
| Superfamily | Total | Max | Total | Max |
| Immunoglobulin | 2752 | 106 | 2968 | 22 |
| EGF/Laminin | 2085 | 19 | 2142 | 11 |
| Spectrin repeat | 2039 | 49 | 1907 | 48 |
| Fibronectin Type III | 1758 | 29 | 1779 | 16 |
| LDL receptor-like module | 997 | 12 | 1219 | 23 |
| ARM repeat | 858 | 13 | 775 | 12 |
| Beta-beta-alpha zinc fingers | 826 | 23 | 768 | 25 |
| Cadherin-like | 819 | 34 | 857 | 78 |
| Complement control module/SCR domain | 811 | 37 | 1025 | 14 |
| Growth factor receptor domain | 696 | 9 | 711 | 7 |