| Literature DB >> 30783571 |
Gustavo A Bravo1, Alexandre Antonelli1,2,3,4, Christine D Bacon2,3, Krzysztof Bartoszek5, Mozes P K Blom6, Stella Huynh7, Graham Jones3, L Lacey Knowles8, Sangeet Lamichhaney1, Thomas Marcussen9, Hélène Morlon10, Luay K Nakhleh11, Bengt Oxelman2,3, Bernard Pfeil3, Alexander Schliep12, Niklas Wahlberg13, Fernanda P Werneck14, John Wiedenhoeft12,15, Sandi Willows-Munro16, Scott V Edwards1,17.
Abstract
Building the Tree of Life (ToL) is a major challenge of modern biology, requiring advances in cyberinfrastructure, data collection, theory, and more. Here, we argue that phylogenomics stands to benefit by embracing the many heterogeneous genomic signals emerging from the first decade of large-scale phylogenetic analysis spawned by high-throughput sequencing (HTS). Such signals include those most commonly encountered in phylogenomic datasets, such as incomplete lineage sorting, but also those reticulate processes emerging with greater frequency, such as recombination and introgression. Here we focus specifically on how phylogenetic methods can accommodate the heterogeneity incurred by such population genetic processes; we do not discuss phylogenetic methods that ignore such processes, such as concatenation or supermatrix approaches or supertrees. We suggest that methods of data acquisition and the types of markers used in phylogenomics will remain restricted until a posteriori methods of marker choice are made possible with routine whole-genome sequencing of taxa of interest. We discuss limitations and potential extensions of a model supporting innovation in phylogenomics today, the multispecies coalescent model (MSC). Macroevolutionary models that use phylogenies, such as character mapping, often ignore the heterogeneity on which building phylogenies increasingly rely and suggest that assimilating such heterogeneity is an important goal moving forward. Finally, we argue that an integrative cyberinfrastructure linking all steps of the process of building the ToL, from specimen acquisition in the field to publication and tracking of phylogenomic data, as well as a culture that values contributors at each step, are essential for progress.Entities:
Keywords: Gene flow; Genome; Multispecies coalescent model; Retroelement; Speciation; Transcriptome
Year: 2019 PMID: 30783571 PMCID: PMC6378093 DOI: 10.7717/peerj.6399
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1A posteriori marker selection from whole-genome alignments for phylogenomics and phylogeography.
Whole-genome analysis (A) permits researchers to choose different markers for specific purposes (B–D). By contrast, subsampling methods such as Rad-seq or hybrid capture, which dominate phylogenomics today, usually yield a specific set of markers that the researcher has chosen a priori. The generation of WGA thus greatly increases the use of genomic data in biological research, beyond the initial goals of the researcher producing those data. Here, we show how a hypothetical WGA that includes seven different loci (different colors) for four individuals allows extracting sequence data to generate gene trees (B), identifying SNPs to genotype individuals (C), and measuring copy depth to infer CNVs across genomic regions (D). Ultimately, these different kinds of data can be translated into species tree inferences (B–D). In the case of CNVs, only locus number 3 (orange) shows significant CNV. Because CNVs are measured as continuous characters (i.e., copy depth), the orange shading represents a hypothetical evolutionary scenario of copy number variation of genomic region number 3 within the inferred species tree, which is incongruent with those based on sequence and SNP data from other loci in the genome.
Figure 2Trends in phylogenomic data sets since the emergence of HTS.
Based on a sample of 164 phylogenomic papers published since 2004 (see Table S1), we observed no increase in the number of species per data set over time (A). On the other hand, there is a significant increase in the number of loci (B), total alignment length (C), and total data set size, as measured by the product of species times locus number (Data set size 1, E) and species times total alignment length (Data set size 2, F). Moreover, the advent of HTS does not support the notion of a tradeoff between the number of species and the number of loci in phylogenomic studies (D).
Definitions of core concepts used in this article.
| Concept | Definition |
|---|---|
| The Tree of Life (ToL) | This idea, originally articulated by Darwin and others, refers to the grand vision of understanding the branching pattern of all life on earth. Today the idea conveys the use of morphological and molecular data to reconstruct the phylogenetic relationships of all life forms. In some usages, the idea also includes reconstructing reticulate evolutionary events, such as introgression and hybridization, which are now thought to be common in many lineages. |
| High-throughput sequencing (HTS) | Also referred to as “next generation sequencing”, this term refers to the plethora of new DNA and RNA sequencing technologies that in the last fifteen years have allowed biologists to dramatically increase the number of bases sequenced for a given species or clade. HTS technologies can be applied to sequencing whole genomes or transcriptomes and have been embraced by phylogeneticists interested in increasing the size of comparative molecular data sets. See |
| The multispecies coalescent model (MSC) | A generalization of the standard, single population coalescent model to multiple species related in a phylogeny. The MSC applies the single-population coalescent model to each branch of a phylogenetic tree, including both terminal and internal branches. In the MSC, alleles sampled in terminal species will coalesce to a smaller number of ancestral alleles at a rate depending on the effective population size within the branch. The gene tree lineages in a branch of the species tree do not necessarily coalesce within that branch as one goes backwards in time; multiple alleles may persist into ancestral branches. This phenomenon is called incomplete lineage sorting (see next definition). The decrease in the number of alleles and the time to coalescence to a single allele in a lineage follows the standard neutral coalescent model, until all alleles coalesce from all species. See |
| Incomplete lineage sorting (ILS) | This phenomenon, originally described by John Avise (see |
| Gene duplication and loss (GDL) | This concept describes the process by which a gene in an ancestral species can duplicate, forming paralogs and one or more of the paralogs can subsequently be deleted from the genome, resulting in complex patterns of relationships among paralogs and orthologs. Gene duplication is another mechanism, in addition to ILS, that can render the gene tree different from the species tree. As a result of gene (paralog) loss, inferring the correct ortholog/paralog relationships and history of branching events in a multigene family can be challenging. Phylogenetic models incorporating GDL try to use patterns in multigene families to deduce the branching history of the constituent species. See |
| Ancestral recombination graph (ARG) | This is a complete record of the coalescent and recombination events in the history of a set of DNA sequences. As a consequence of incorporating recombination events, ARGs do not necessarily depict trees, but often have a network structure. The accurate estimation of ARGs remains challenging but they enhance our ability to estimate recombination rates, ancestral effective population sizes, population divergence times, rates of gene flow between populations, and detect selective sweeps. See |
| Lateral gene transfer (LGT) | This process occurs when genes jump taxonomic and phylogenetic boundaries, moving between unrelated species and therefore causing discordances between genetic and lineage history. LGT, along with ILS and GDL was among the three primary causes of discordance between gene and species trees identified by |
Figure 3Some examples of violations of the multispecies coalescent.
In event A, there is gene flow; in event B there is homoploid hybridization; in event C, there is a gene duplication; and in event D, incomplete lineage sorting. All of these processes contribute to gene tree heterogeneity but fall outside the standard multispecies coalescent model. Importantly, all of these processes also yield strictly dichotomous gene trees, whereas recombination (not illustrated here) does not.
Figure 5Complex patterns of gene lineages with polyploidization and interspecific gene flow.
Genes and genomes of four species A, B, C and D. Multi-colored bars show (parts of) genomes with a number of loci indicated in different colors. Two gene trees, one orange and one blue, evolve within the species network. Species B is an allopolyploid containing two genomes.
Figure 6Gradual speciation, or isolation-with migration.
After starting to split, gene flow between species decreases gradually. Such a gradual decrease in the extent of gene flow between species might present an especially useful extension of the standard multispecies coalescent model. Colors depict different gene pools and their gradual change along branches describes how species gradually differentiate despite the existence of migration over time. Thickness and color intensity of arrows show that gene flow becomes weaker as species gradually isolate.
Figure 7Two possible species phylogenies producing similar observations at present time.
(A) species tree with gene flow. (B) Species network with homoploid hybridization. Distinguishing two such scenarios usually requires simulations and comparison of observed and expected summary statistics.
Figure 4Gene duplication and loss (GDL) creates patterns that can mimic incomplete lineage sorting and other processes, leading to spurious inferences of the species history.
Genes and genomes of three species A, B, and C. Multi-colored bars show (parts of) their genomes with a number of loci indicated in different colors. The orange gene is duplicated in species A and it was lost in species B. The blue gene was duplicated before the divergence between species A and the ancestor of species B and C. However, one of these copies was lost in species A, whereas both copies were maintained in species B and C. Reconstruction of the orange gene tree based on extant diversity will yield a wrong inference of its history due to the absence of data for species B. On the other hand, a phylogenetic reconstruction of the blue gene is difficult to predict. Depending on which of the duplicates are sampled for species B and C, different outcomes can be expected regarding the relationship among the three species. The duplication and loss history of these two genes may cause serious issues for phylogenetic reconstruction because no specific pattern can be expected between them.
Challenges in the fields of coalescent-based phylogenomics and implications for unraveling character evolution and the Tree of Life.
| Category | Challenge | Proposed strategy |
|---|---|---|
| Data | Integration and assessment of large amounts of data with heterogeneous phylogenetic signal. | Protocols for marker selection should assess markers’ biological relevance and adequacy for the study organism, given the temporal and spatial scales in question, and not only logistical convenience. A posteriori (after data generation) marker selection from whole-genome alignments can be useful to inform these aspects as well as minimize the effects of missing data and varying data quality. Until then, researchers should attempt a higher standardization of markers to facilitate combinatory analyses. |
| To discern true phylogenomic heterogeneity from noise and error as well as to identify violations of the MSC, adequate filtering of large phylogenomic datasets should be conducted based on biological and statistical properties of markers (e.g., analyses of gene-tree outliers and rogue taxa). | ||
| Further research on filtering methods as well as on their impact on phylogenomic estimation is still required. | ||
| Inclusion of additional character types into phylogenomic analyses. | Research efforts focused on the adequate identification and utilization of rare genomic changes other than nucleotide substitutions, such as indels, transpositions, inversions, CNVs, and chromosomal rearrangements. Development of new methods not only to infer phylogenetic hypotheses based on these characters but also to integrate them with more traditional sequence data. | |
| Phylogenetic inference models and methods | Analyses of genome-scale data for large numbers of species within a coalescent framework. | Continue the development of models and methods that allow simultaneous gene tree and species tree estimation within a Bayesian framework (e.g., |
| For the time being, two-step methods, particularly those based on biological models and permitting statistical tests of topologies in a likelihood framework (e.g., | ||
| Detection and incorporation of violations of the MSC into phylogenomic inferences. | Extensions of the MSC should seek the inference of reticulate evolutionary histories (i.e., multispecies network coalescent; | |
| Further development of conceptual approaches aimed at detecting and quantifying different underlying biological processes of phylogenetic history (e.g., | ||
| Models that integrate phylogenomics and comparative analyses | Integrating different phylogenetic signals into comparative analyses. | Methods and models should attempt to incorporate gene tree incongruence into macroevolutionary models of character evolution. Similarly, integrative studies aiming at unraveling character evolution at the molecular, phenotypic, and clade levels. |
| Understanding the genomic bases of character evolution in species trees vs. gene trees. | Methods that estimate phenotype-genotype associations incorporating heterogeneity across gene trees or that at least take into account differential state probabilities stemming from gene tree discordance (e.g., | |
| Best practices for building the ToL | Increasing the number of species represented in the ToL while ensuring reproducibility and encouraging community participation | Natural history museums must be central players for providing and analyzing genome-scale data. Genetic resources and specimen collections are fundamental for allowing the acquisition of data for extinct and poorly accessible species. Open access community initiatives must continue to be relevant repositories of the ToL. Adequate methods for curation of data and analytical tools must continue to be a high priority. |