| Literature DB >> 30026791 |
Alexander Donath1, Peter F Stadler2,3,4,5,6,7,8.
Abstract
BACKGROUND: Most phylogenetic studies using molecular data treat gaps in multiple sequence alignments as missing data or even completely exclude alignment columns that contain gaps.Entities:
Keywords: Genome-wide multiple sequence alignments; In/del; Phylogenomics; Splits
Year: 2018 PMID: 30026791 PMCID: PMC6047143 DOI: 10.1186/s13015-018-0130-7
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Non-trivial example of the determination of splids with size 2 from two concatenated alignments (A and B). Alignment A contains sequence data for all taxa, whereas B lacks sequence information for taxon g. At first, all indel loci are determined (I–IV). Second, indel loci are searched for indels constituting splids. From locus I only indels (4) and (6) fulfill this criterion. Indels (1) and (3) do not share a common 5’ end. Indel (8) is too short. Indels (9) and (10) of locus III are overlapping splids. Whether or not indel (11) is included in the final splid set depends on the applied algorithm. In strict mode it is not included, due to the single-residue indel (13). In fuzzy mode, it is included and taxon g is marked as missing data (“?”) in the binary presence/absence coding
Overview of the total number of sites of all alignments per alignment method and the number of derived splids with length 2 bp for the ENCODE data set containing only alignments with sequence information for all taxa
| Program | Number of sites | Number of |
|---|---|---|
| ClustalW | 79,006 | 793 |
| Dialign-TX | 96,990 | 2163 |
| Mafft | 84,105 | 1021 |
| Mafft L-INS-i | 83,578 | 1245 |
| Mafft G-INS-i | 83,123 | 1279 |
| Muscle | 84,577 | 1378 |
| ProbConsRNA | 86,277 | 1927 |
| Prank | 96,622 | 2047 |
| T-Coffee | 84,835 | 1831 |
| TBA/Multiz | 90,726 | 2032 |
Fig. 2Number of splids with a length of 2 bp that have been extracted from the alignments of the ENCODE data set containing sequence information for all taxa
Detailed comparision of the differences between the ENCODE guide tree and the best maximum likelihood trees calculated from splid data derived from various alignment tools
| ClustalW | Dialign-TX | Mafft | Mafft G-INS-i | Mafft L-INS-i | Muscle | Prank | ProbConsRNA | T-Coffee | TBA/Multiz | |
|---|---|---|---|---|---|---|---|---|---|---|
| Afrotheria | × | × | – | × | × | × | × | × | × | × |
| Sister group to Boreoeutheria | Sister group to Boreoeutheria | Sister group to Boreoeutheria | Sister group to Boreoeutheria | Within Boreoeutheria | Sister group to Xenarthra | Sister group to (Laurasiatheria, Xenarthra) | Sister group to (Laurasiatheria, Xenarthra) | Sister group to Euarchontoglires | ||
| ((elephant, rock hyrax), tenrec) | × | × | – | × | × | × | × | × | × | × |
| Xenarthra (Armadillo) | Sister taxon to Epitheria | Sister taxon to Epitheria | Sister taxon to part of the paraphyletic Epitheria | Sister taxon to Epitheria | Sister taxon to Epitheria | Sister taxon to Epitheria | Sister taxon to Afrotheria | Sister taxon to Laurasiatheria | Sister taxon to Laurasiatheria | Sister group to Epitheria |
| Boreoeutheria | × | × | × | × | × | – | – | – | – | – |
| Laurasiatheria | × | × | × | × | × | × | × | × | × | × |
| Insectivora | × | × | × | × | × | × | × | × | × | × |
| Chiroptera | × | × | × | × | × | × | × | × | × | × |
| ((rfbat, flying fox), sbbat) | × | × | × | × | × | × | × | – | × | – |
| Carnivora | × | × | × | × | × | × | × | × | × | × |
| Horse | (bats, horse) | (Carn., horse) | (Carn., horse) | (Carn., horse) | (cow, horse) | ((bats, cow), horse) | ((bats, cow), horse) | (((bats, cow), Carn.), horse) | (cow, horse) | (Carn., horse) |
| Cow | ((bats, horse), cow) | (((Carn., horse), bats), cow) | (((Carn., horse), bats), cow) | (bats, cow) | (cow, horse) | (bats, cow) | (bats, cow) | (bats, cow) | (cow, horse) | (((Carn., horse), bats), cow) |
| Euarchontoglires | × | × | × | × | × | – | × | × | – | × |
| Glires | × | – | – | – | – | – | × | × | – | × |
| Rodentia | – | – | – | × | × | – | × | × | – | × |
| Muroidea | × | × | × | × | × | × | × | × | × | × |
| Rabbit | Sister taxon to Muroidea | Sister taxon to tree shrew; both sister group to Primata | Basal within Euarchontoglires | Sister taxon to tree shrew; both sister group to Rodentia | Sister taxon to tree shrew; both sister group to Rodentia | Sister taxon to Euarchontoglires | Sister taxon to Rodentia | Sister taxon to Rodentia | Sister taxon to Primata | Sister taxon to Rodentia |
| Primata | × | × | × | × | × | × | × | – | × | × |
| Strepsirrhini | × | × | × | × | × | × | × | × | × | × |
| Platyrrhini | × | × | × | × | × | × | × | × | × | × |
| (((squirrel monkey, marmoset), owl monkey), dusky titi) | – | – | – | – | – | – | – | – | × | – |
| Catarrhini | × | × | × | × | × | × | × | × | × | × |
| Cercopithecidae | × | × | × | × | × | × | × | × | × | × |
| (((baboon, macaque), vervet), colobus) | – | – | – | × | × | × | × | – | × | – |
| Hominoidea | × | × | × | × | × | × | × | × | × | × |
| (((chimp, human), orangutan), gibbon) | × | × | × | × | × | × | – | – | × | – |
| Tree shrew | In Glires; sister taxon to (Hystricomorpha, Sciuromorpha) | Sister taxon to rabbit; both sister group to Primata | Sister taxon to Rodentia | Sister taxon to rabbit; both sister group to Rodentia | Sister taxon to rabbit; both sister group to Rodentia | Sister taxon to (Hystricomorpha, Sciuromorpha) | Basal within Euarchontoglires | Sister taxon to Strepsirrhini | Sister taxon to remaining (Epitheria, Xenarthra) | Sister taxon to Glires |
|
| 20 | 18 | 18 | 14 | 16 | 22 | 16 | 20 | 20 | 14 |
|
| 0.3030 | 0.2727 | 0.2727 | 0.2121 | 0.2424 | 0.3333 | 0.2424 | 0.3030 | 0.3030 | 0.2121 |
|
| 2347 | 2980 | 2748 | 1892 | 2043 | 6376 | 4164 | 6951 | 9458 | 3932 |
|
| 0.0398 | 0.0506 | 0.0467 | 0.0321 | 0.0347 | 0.1082 | 0.0707 | 0.1180 | 0.1606 | 0.0668 |
Splids (gap length 2 bp) were extracted from the ENCODE regions containing sequence information for all taxa. For each tree the symmetric difference (Robinson–Foulds distance, ), the normalized RF distance (), the quartet distance (, at most 58,905), and the normalized quartet distance () to the ENCODE guide tree is shown. rfbat = Rhinolophus ferrumequinum, sbbat = Myotis lucifugus, Carn. = Carnivora. “×” = monophyly/position recovered, “–” = monophyly/position not recovered. See text for details
Fig. 3Cladogram with bootstrap values obtained from 100 bootstrap trees calculated by RAxML using splid data and the Gamma model with ascertainment bias correction. Splids with gap lengths 2 bp were extracted from the small ENCODE data set that has been re-aligned using Mafft G-INS-i
Results for the large ENCODE data set. Splids 2 bp were coded and trees were calculated with RAxML using the Gamma model for binary data and ascertainment bias correction
| Mafft G-INS-i | T-Coffee | TBA/Multiz | |
|---|---|---|---|
| Number of sites | 36,132,992 | 36,450,667 | 37,689,662 |
| Number of | 545,790 | 922,277 | 919,908 |
|
| 16 | 16 | 12 |
|
| 0.2424 | 0.2424 | 0.1818 |
|
| 5000 | 7494 | 3710 |
|
| 0.0849 | 0.1272 | 0.0630 |