| Literature DB >> 23339707 |
Cheong Xin Chan1, Mark A Ragan.
Abstract
Thanks to advances in next-generation technologies, genome sequences are now being generated at breadth (e.g. across environments) and depth (thousands of closely related strains, individuals or samples) unimaginable only a few years ago. Phylogenomics--the study of evolutionary relationships based on comparative analysis of genome-scale data--has so far been developed as industrial-scale molecular phylogenetics, proceeding in the two classical steps: multiple alignment of homologous sequences, followed by inference of a tree (or multiple trees). However, the algorithms typically employed for these steps scale poorly with number of sequences, such that for an increasing number of problems, high-quality phylogenomic analysis is (or soon will be) computationally infeasible. Moreover, next-generation data are often incomplete and error-prone, and analysis may be further complicated by genome rearrangement, gene fusion and deletion, lateral genetic transfer, and transcript variation. Here we argue that next-generation data require next-generation phylogenomics, including so-called alignment-free approaches.Entities:
Mesh:
Year: 2013 PMID: 23339707 PMCID: PMC3564786 DOI: 10.1186/1745-6150-8-3
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Figure 1Simplified workflow of phylogenomic approaches. Workflow is shown for (A) the classical approach based on multiple sequence alignment, and (B) an alternative approach based on alignment-free methods, for a simple analysis example of homologous sequences 1, 2, 3 and 4, with a known phylogeny as a reference (shown on top). Sequence fragments that share the same ancestry across all four sequences (i.e. are highly similar among one another) are shown in the same colour (red, blue, yellow and orange regions in each sequence). In this example, the yellow and blue regions of sequences 2 and 4 have undergone rearrangement relative to 1 and 3. The dark yellow (in 1 and 2) and light yellow (in 3 and 4) regions are similar to each other. While the classical approach based on multiple sequence alignment (gaps introduced as dashed lines) yields an inaccurate phylogeny, the alternative alignment-free approach (grouping of sub-sequences) is not affected by the sequence rearrangement in 2 and 4, and yields the correct phylogeny. The difference between the two resulting phylogenetic trees is highlighted in red.
Comparison of key features between phylogenomic approaches based on multiple sequence alignment and alignment-free approaches
| Assumes contiguity (with gaps) of homologous regions | Does not assume contiguity of homologous regions |
| Based on all possible pairwise comparisons of whole sequences; computationally expensive | Based on occurrences of sub-sequences; computationally inexpensive, can be memory-intensive |
| Well-established and well-studied approach in phylogenomics | Application in phylogenomics limited; requires further testing for robustness and scalability |
| More dependent on substitution/evolutionary models | Less dependent on substitution/evolutionary models |
| More sensitive to stochastic sequence variation, recombination, lateral genetic transfer, rate heterogeneity and sequences of varied lengths, especially when similarity lies in the “twilight zone” | Less sensitive to stochastic sequence variation, recombination, lateral genetic transfer, rate heterogeneity and sequences of varied lengths |
| Best practice uses inference algorithms with complexity at least | Inference algorithms typically |
| Heuristic solutions; statistical significance of how alignment scores relate to homology is difficult to assess | Exact solutions; statistical significance of the sequence distances (and degree of similarity) can be readily assessed |