| Literature DB >> 23259766 |
Vassily A Lyubetsky1, Lev I Rubanov, Leonid Y Rusin, Konstantin Yu Gorbunov.
Abstract
BACKGROUND: A long recognized problem is the inference of the supertree S that amalgamates a given set {G(j)} of trees G(j), with leaves in each G(j) being assigned homologous elements. We ground on an approach to find the tree S by minimizing the total cost of mappings α(j) of individual gene trees G(j) into S. Traditionally, this cost is defined basically as a sum of duplications and gaps in each α(j). The classical problem is to minimize the total cost, where S runs over the set of all trees that contain an exhaustive non-redundant set of species from all input G(j).Entities:
Mesh:
Year: 2012 PMID: 23259766 PMCID: PMC3577452 DOI: 10.1186/1745-6150-7-48
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Figure 1A transition from tree to tree . Leaves 1, 2, 3 contain in-group species, leaf d* contains an auxiliary outgroup species. Leaf d* is connected to the root by the outgroup tube (shown in bold). All tubes acquire additional vertices during transition to S0 (right, shown in bold) to delimit time slices (here four slices are separated by dashed lines). Each slice thus contains one segment of the outgroup tube in S (left), which forms the outgroup tubes in S0 (each shown in bold). Any such segment, as well as the outgroup leaf-species, are denoted as d*. The root tube d0 is attached to the root, by analogy with the edge e0 in a gene tree G.
Types of evolutionary events and their costs
| 0 | cohered leaf edge | evolution of gene | ||
| 1 | non-cohered leaf edge | gene | ||
| 2 | same as #1 but | gene | ||
| 3 | tube | gene | ||
| 4 | edge | |||
| 5 | same as #4 | |||
| 6 | gene | |||
| 7 | same as #6 | gene | ||
| 8 | gene | |||
| 9 | same as #8 | |||
| 10 | gene | |||
| 11 | same as #10 | |||
| 12 | edge | gene | ||
| 13 | same as #12 but | one of the descendants of | ||
| 14 | edge | gene | ||
| 15 | edge | one copy | ||
| 16 | same as #15 | one copy | ||
| 17 | edge | gene | ||
| 18 | same as #17 | gene | ||
| 19 | gene | |||
| 20 | gene | |||
| 21 | gene | |||
| 22 | gene | |||
| 23 | edge | gene | ||
| 24 | same as #23 | gene | ||
| 25 | gene | |||
| 26 | same as #25 | gene | ||
| 27 | gene | |||
| 28 | same as #27 | gene | ||
| 29 | gene | |||
| 30 | same as #29 | gene | ||
| 31 | edge | gene | ||
| 32 | edge | gene | ||
| 33 | edge | gene | ||
| 34 | gene |
Consider i as the number of the event (and the row number) in a fixed enumeration pattern; “Condition” defines the applicability of the event to current pair
Figure 2An additional “root” edge between the “super-root” and the initial root in tree . This root is used to define the set R(V), since the vertex R' can be good. The root edge e0 is analogous to the root tube d0 in Figure 1.
Figure 3Tree () for a fixed partition = + . Here V, V1, V2 designates both the corresponding vertex and the edge upwards from this vertex. Trees S(V1) and S(V2), as well as their costs c(V1) and c(V2), are already known from induction, and C + C corresponds to evolutionary events in edges of V, V1, V2. Those are not known from induction and should be computed separately, as defined below in the text.
Figure 4Two possibilities of inserting a new vertex connected by an edge with a new species
Figure 5The inductive step in computing the cost (,). On the left is an illustration of assembling the tree G from subtrees G1 and G2. Here е1 is the root edge in G1, е2 – the root edge in G2 (the root edges belong to their corresponding trees), and е – the root edge in G. Figure on the right illustrates an embedding of G into S. Costs c(G1,S) and c(G2,S) are computed with induction. Then c(G,S) = c(G1,S) + c(G2,S) + x, here х=с(loss)+c(transfer)+c(dupl), and summands are parameters of the algorithm (elementary event costs).
Figure 6An example of () value. Edge e may cross different time slices (not shown), undergoes several speciation events with a loss, two horizontal transfers without retention, and, importantly, terminates with a duplication event.
Definitions of events in the second scenario design (in the DAG)
| 0 | Edge is not projected (induction ends) | None |
| 1 | < | < |
| 2 | same as #1 | < |
| 3 | < | None |
| 4 | < | None |
| 5 | < | None |
| 6 | < | < |
| 7 | < | < |
| 8 | < | None |
| 9 | < | None |
| 10 | < | None |
| 11 | None | |
| 12 | < | < |
| 13 | < | None |
| 14 | < | None |
| 15 | < | < |
| 16 | < | < |
| 17 | < | < |
| 18 | < | < |
| 19 | < | < |
| 20 | < | |
| 21 | < | |
| 22 | < | < |
| 23 | < | < |
| 24 | < | < |
| 25 | < | < |
| 26 | < | < |
| 27 | < | < |
| 28 | < | |
| 29 | < | < |
| 30 | < | < |
| 31 | < | < |
| 32 | < | < |
| 33 | < | < |
| 34 | < | < |
Consider i the ordering of events specified in Table 1; the second column specifies the termini of the edge projected from the pair
Comparison of Super3GL with RFsupertrees and CLANN version 3.0.2
| Artificial data (Additional file | |||
| Supertree | Figure | Additional file | Additional file |
| Total cost of | 97443 | 114028 | 158751 |
| Cost of the second scenario | 151630 | 173527 | 218958 |
| Running time | 21 m | 10 m | 847 m |
| Biological data (Additional file | |||
| Supertree | Figure | Additional file | Additional file |
| Total cost of | 210917 | 234880 | 234933 |
| Cost of the second scenario | 535524 | 660021 | 706826 |
| Running time | 14 m | 107 m | 2234 m |
The total cost of the supertree and the cost of the second evolutionary scenario are defined with c({G}, S *) and formula (6), respectively. Individual event costs are as follows: c(dupl) = 3, c(loss) = 2, c(gain) = 12, c(gain_big) = 10, c(sleep) = 20, c(tr_with) = 17.6, c(tr_without) = 19.6.
Figure 7The artificial species tree * used to simulate sets {} of gene trees (40 species). The tree root is denoted by R. One of the simulated sets {G} is presented in Additional file 3. The Super3GL program applied to {G} reconstructed the known supertree S* in 95% cases. The total mapping cost equals 97443. Leaf notations: Archaea: Archaeoglobus fulgidus (Afu), Halobacterium sp. NRC-1 (Hbs), Methanococcus jannaschii (Mja), Methanobacterium thermoautotrophicum (Mth), Thermoplasma acidophilum (Tac), Thermoplasma volcanium (Tvo), Pyrococcus horikoshii (Pho), Pyrococcus abyssi (Pab), Aeropyrum pernix (Ape), Sulfolobus solfataricus (Sso); Gram-positive bacteria: Streptococcus pyogenes (Spy), Bacillus subtilis (Bsu), Bacillus halodurans (Bha), Lactococcus lastis (Lla), Staphylococcus aureus (Sau), Ureaplasma urealyticum (Uur), Mycoplasma pneumoniae (Mpn), Mycoplasma genitalium (Mge); α-Proteobacteria: Mesorhizobium loti (Mlo), Caulobacter crescentus (Ccr), Rickettsia prowazekii (Rpr); β-Proteobacteria: Neisseria meningitidis MC58 (Nme); γ-Proteobacteria: Escherichia coli K12 (Eco), Buchnera sp. APS (Buc), Pseudomonas aeruginosa (Pae), Vibrio cholerae (Vch), Haemophilus influenzae (Hin), Pasteurella multocida (Pmu), Xylella fastidiosa (Xfa); ε-Proteobacteria: Helicobacter pylori (Hpy), Campylobacter jejuni (Cje); Chlamydia: Chlamydia trachomatis (Ctr), Chlamydia pneumoniae (Cpn); Spirohetes: Treponema pallidum (Tpa), Borrelia burgdorferi (Bbu); others: Deinococcus radiodurans (Dra), Mycobacterium tuberculosis (Mtu), Synechocystis (Syn), Aquifex aeolicus (Aae), Thermotoga maritime (Tma).
Figure 8The supertree built by Super3GL for biological data from Additional file 4. The tree root is denoted by R. Numbers indicate the phylogenetic patterns mentioned in Testing of the algorithms.
Example characteristics of the first and second scenario designs
| | ||||
|---|---|---|---|---|
| Total cost / expectation | 97443.4 | 151629.7 | 210917.0 | 535524.0 |
| Total cost / expectation of gains | 60.0 | 358.4 | 53448.0 | 77040.5 |
| Total cost / expectation of losses | 38024.0 | 56660.0 | 98376.0 | 187600.5 |
| Total cost / expectation of duplications | 26796.0 | 34324.6 | 38286.0 | 44639.6 |
| Total cost / expectation of transfers | 32563.4 | 60168.3 | 17887.0 | 223854.8 |
| Total cost / expectation of the gain_big events | 0.0 | 118.4 | 2920.0 | 2388.6 |
| Running time | <1m | 2m | 15m | 41m |
Input tree data is the same as for Table 3. The tree S is obtained by the supertree building algorithm described in the paper. The degree of ramification k = 10. Individual event costs are as follows: c(dupl)=3, c(loss)=2, c(gain)=12, c(gain_big)=10, c(sleep)=20, c(tr_with)= 17.6, c(tr_without)=19.6. The running time is specified for parallel computations on a 16-CPUs platform. The cost in the second design and the expectation of the total event cost are defined in Table 3 and by formula (6), respectively.