| Literature DB >> 27481787 |
Abstract
The unprecedented amount of data resulting from next-generation sequencing has opened a new era in phylogenetic estimation. Although large datasets should, in theory, increase phylogenetic resolution, massive, multilocus datasets have uncovered a great deal of phylogenetic incongruence among different genomic regions, due both to stochastic error and to the action of different evolutionary process such as incomplete lineage sorting, gene duplication and loss and horizontal gene transfer. This incongruence violates one of the fundamental assumptions of the DNA barcoding approach, which assumes that gene history and species history are identical. In this review, we explain some of the most important challenges we will have to face to reconstruct the history of species, and the advantages and disadvantages of different strategies for the phylogenetic analysis of multilocus data. In particular, we describe the evolutionary events that can generate species tree-gene tree discordance, compare the most popular methods for species tree reconstruction, highlight the challenges we need to face when using them and discuss their potential utility in barcoding. Current barcoding methods sacrifice a great amount of statistical power by only considering one locus, and a transition to multilocus barcodes would not only improve current barcoding methods, but also facilitate an eventual transition to species-tree-based barcoding strategies, which could better accommodate scenarios where the barcode gap is too small or inexistent.This article is part of the themed issue 'From DNA barcodes to biomes'.Entities:
Keywords: barcode gap; incomplete lineage sorting; multilocus barcoding; multispecies coalescent; phylogenetic incongruence; species tree reconstruction
Mesh:
Year: 2016 PMID: 27481787 PMCID: PMC4971187 DOI: 10.1098/rstb.2015.0335
Source DB: PubMed Journal: Philos Trans R Soc Lond B Biol Sci ISSN: 0962-8436 Impact factor: 6.237
Acronym table.
| acronym | meaning |
|---|---|
| AFLP | amplified fragment length polymorphism |
| GDL | gene duplication and loss |
| GTP | gene tree parsimony |
| HGT | horizontal gene transfer |
| ILS | incomplete lineage sorting |
| MSC | multispecies coalescent |
| SNP | single nucleotide polymorphism |
Figure 1.Evolutionary processes that generate species tree/gene tree incongruence. The figure shows the species tree (grey tree in the background) and a gene tree (black tree) tracking the evolutionary history of six species (A, B, C, D, E and F) and nine gene copies (A0α, A0β, B0, C0, C1, D0, E0, E1 and F0) in eight individuals (A0, B0, C0, C1, D0, E0, E1, F0). Each evolutionary process is indicated by a label and a specific figure in the node where it is mapped (duplication, square; loss, cross; transfer, arrow; deep coalescence, circle; hybridization, pentagon; gene flow, ellipse). Dashed lines indicate superfluous lineages that do not reach the present due to gene loss.
Figure 2.Multispecies coalescent model. The figure shows the species tree (grey tree in the background) and a gene tree (black tree) tracking the evolutionary history of five species (A, B, C, D and E) and several individuals per species. Each species tree branch corresponds to an independent coalescent process. Gene tree nodes are depicted with circles, where open circles indicate deep coalescences. The confounding effect of ILS on standard barcoding techniques is reflected here, for example between species A and B. The individual B0 from species B clusters with individuals A2 and A3 from species A therefore shows the absence of a barcode gap.
Species tree reconstruction programs. For each program (the list is not exhaustive), the table indicates the evolutionary processes that generate species tree/gene tree discordance explicitly taken into consideration by the model (Ev. process; ILS, incomplete lineage sorting; GDL, gene duplication and losses, HGT, horizontal gene transfer), input data (MSAs, multiple sequence alignments; SNPs, single nucleotide polymorphisms), output data and the amount of data each software is intended to handle (scalability).
| Ev. process | strategy | input | output | scalability | |
|---|---|---|---|---|---|
| ASTRAL I/II [ | none | algorithm: quartet compatibility | unrooted trees | unrooted supertree | genome-wide |
| BUCKy [ | none/ILS | Bayesian inference: concordance factors | unrooted distributions | unrooted species tree and gene tree distributions | multilocus |
| RF supertrees [ | none | heuristic: distance | rooted trees | rooted supertree | genome-wide |
| MulRF [ | none | heuristic: distance | unrooted trees | unrooted supertree | genome-wide |
| iGTP [ | ILS or GDL | heuristic: reconciliation cost | rooted or unrooted trees | rooted or unrooted supertrees | genome-wide |
| SPRSupertrees [ | HGT | heuristic: distance | unrooted trees | unrooted or rooted supertrees | genome-wide |
| GLASS, SD, MAC, STEAC, STAR [ | ILS | algorithm: distance | rooted trees | rooted species tree | genome-wide |
| NJst/ASTRID [ | ILS | algorithm: distance | unrooted trees | unrooted species tree | genome-wide |
| STEM [ | ILS | algorithm: distance + likelihood | rooted trees, theta, rate | rooted species tree | genome-wide |
| MP-EST [ | ILS | heuristic: pseudo-likelihood | rooted trees | rooted species tree | genome-wide |
| STELLS [ | ILS | heuristic: pseudo-likelihood | rooted trees | rooted species tree | genome-wide |
| Guenomu [ | ILS + GDL + HGT + distances | Bayesian inference: distance-based model | unrooted tree distributions | rooted species tree and unrooted gene tree distributions | genome-wide |
| PHYLDOG [ | GDL | heuristic: likelihood | MSAs | rooted supertree | genome-wide |
| BEST [ | ILS | Bayesian inference: MSC | MSAs | rooted species and gene tree distributions | small multilocus datasets |
| *BEAST [ | ILS | Bayesian inference: MSC | MSAs | rooted species and gene tree distributions | small multilocus datasets |
| SVDQuartets [ | ILS | algorithm: singular value decomposition of site pattern frequency matrix + quartet tree reconstruction | SNPs | unrooted species tree | genome-wide |
| Phylonet [ | ILS + hybridization | heuristic (multiple): reconciliation cost/pseudo-likelihood/likelihood | unrooted gene trees | unrooted species networks | multilocus datasets |
| SNAPP [ | ILS | Bayesian inference: MSC | SNPs | rooted species tree distribution | genome-wide |