| Literature DB >> 29495400 |
Mark S Springer1, John Gatesy2.
Abstract
coalescence methods have emerged as a popular alternative for inferring species trees with large genomic datasets, because these methods explicitly account for incomplete lineage sorting. However, statistical consistency of summary coalescence methods is not guaranteed unless several model assumptions are true, including the critical assumption that recombination occurs freely among but not within coalescence genes (c-genes), which are the fundamental units of analysis for these methods. Each c-gene has a single branching history, and large sets of these independent gene histories should be the input for genome-scale coalescence estimates of phylogeny. By contrast, numerous studies have reported the results of coalescence analyses in which complete protein-coding sequences are treated as c-genes even though exons for these loci can span more than a megabase of DNA. Empirical estimates of recombination breakpoints suggest that c-genes may be much shorter, especially when large clades with many species are the focus of analysis. Although this idea has been challenged recently in the literature, the inverse relationship between c-gene size and increased taxon sampling in a dataset-the 'recombination ratchet'-is a fundamental property of c-genes. For taxonomic groups characterized by genes with long intron sequences, complete protein-coding sequences are likely not valid c-genes and are inappropriate units of analysis for summary coalescence methods unless they occur in recombination deserts that are devoid of incomplete lineage sorting (ILS). Finally, it has been argued that coalescence methods are robust when the no-recombination within loci assumption is violated, but recombination must matter at some scale because ILS, a by-product of recombination, is the raison d'etre for coalescence methods. That is, extensive recombination is required to yield the large number of independently segregating c-genes used to infer a species tree. If coalescent methods are powerful enough to infer the correct species tree for difficult phylogenetic problems in the anomaly zone, where concatenation is expected to fail because of ILS, then there should be a decreasing probability of inferring the correct species tree using longer loci with many intralocus recombination breakpoints (i.e., increased levels of concatenation).Entities:
Keywords: coalescence genes; phylogenomics; protein-coding sequences; recombination breakpoints; recombination ratchet
Year: 2018 PMID: 29495400 PMCID: PMC5867844 DOI: 10.3390/genes9030123
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Illustration of the recombination ratchet for a hypothetical 10-kb chromosomal segment of DNA and a phylogeny with nine taxa and three short internodes, each of which results in incomplete lineage sorting and deep coalescence for a local subtree with three taxa. The three short internodes (internal branches) are labeled with red asterisks. Other internal branches are longer, and deep coalescence does not occur. The three subtrees are for taxa A-B-C (a); D-E-F (b); and G-H-I (d). Incomplete lineage sorting for each set of three taxa (a,b,d) is associated with nine recombination breakpoints and ten c-genes of average length 1000 bp. For each set of three taxa there are three different genealogical histories with contrasting topological relationships (genealogical histories for the same topology but with different branch lengths are ignored). Numbers below c-genes correspond to Robinson–Foulds (RF) distances [24], sensu Sul and Williams [25], relative to the species tree. Note that for each chromosomal segment with only three taxa (panels a,b,d), the maximum RF distance is 1. The overlay of nine recombination breakpoints for A-B-C and nine recombination breakpoints for D-E-F results in a total of 18 recombination breakpoints and 19 c-genes for the six-taxon phylogeny (A-B-C-D-E-F) (panel c). Average c-gene size for the 10-kb chromosomal segment with six-taxa is 526 bp. Nine different topologies are possible for these six taxa, of which seven are represented among the 19 c-genes. The maximum RF distance is 2 for c-genes based on six taxa (panel c). The overlay of 18 recombination breakpoints for A-B-C-D-E-F with nine recombination breakpoints for G-H-I results in 27 recombination breakpoints and 28 c-genes for the nine-taxon phylogeny (A-B-C-D-E-F-G-H-I) (panel e). For the nine-taxon phylogeny, mean c-gene size is reduced to 357 bp. Among the 28 c-genes for the nine-taxon phylogeny, 16 of 27 possible topologies are represented (panel e).
Comparison of the number of recombination breakpoints and coalescence genes (c-genes) based on CoalHMM and the four-gamete test (FGT) for four target regions from Hobolth et al. [18].
| Target Number | Target Length | Mean CoalHMM C-Gene Length | Four-Gamete Test | ||
|---|---|---|---|---|---|
| Number of Recombination Breakpoints | Number of C-Genes | Mean FGT C-Gene Length | |||
| 1 | 1255492 | 123.0 | 1143 | 1144 | 1097.5 |
| 106 | 257420 | 84.1 | 303 | 304 | 846.8 |
| 121 | 230666 | 102.9 | 287 | 288 | 800.9 |
| 122 | 92240 | 108.8 | 126 | 127 | 726.3 |
CoalHMM c-gene sizes are based on Hobolth et al.’s [18] analyses with additional calculations reported in Springer and Gatesy [7].
Figure 2Results of four-gamete test (FGT) for the ARMC3 gene (~120.1 kb alignment). The FGT was applied separately to two subtrees of primates (panel a)—four hominids (Homo, Pan, Gorilla, Pongo) and four cercopithecids (3 Macaca spp., Papio). Primate relationships are as in Springer et al. [71]. The FGT was applied to the entire ARMC3 gene from start codon to stop codon, but the results are only illustrated for the first 12 kb of the alignment. Recombination breakpoints and c-gene trees are shown for hominids (panel b) and Macaca spp. (panel c). In both cases there are three possible topologies, all of which are represented by one or more c-gene trees (the outgroups Pongo and Papio are not shown). Panel (d) shows the results of the recombination ratchet, where overlay of 11 recombination breakpoints for hominids and three recombination breakpoints for Macaca results in 14 recombination breakpoints and 15 c-genes for the 12 kb alignment (panel d). Among the nine topologies that are possible for the two subtrees of three taxa, seven are represented among the 15 c-gene trees (panel d). Paintings of Pongo and Papio by Carl Buell.
Comparison of the number of recombination breakpoints and mean c-gene size based on seven different recombination detection methods implemented in RDP4 [62] that were applied to four target regions from Hobolth et al. [18].
| Method | Number of Recombination Breakpoints/Mean C-Gene Length (kb) | |||
|---|---|---|---|---|
| Target 1 | Target 106 | Target 121 | Target 122 | |
| RDP | 45/27.3 | 21/11.7 | 20/11.0 | 4/18.4 |
| GENECONV | 29/41.8 | 11/21.5 | 9/23.1 | 5/15.4 |
| MaxChi | 17/69.7 | 4/51.5 | 7/28.8 | 4/18.4 |
| Chimaera | 14/83.7 | 1/128.7 | 8/25.6 | 2/30.7 |
| BootScan | 36/33.9 | 14/17.2 | 16/13.6 | 9/9.2 |
| 3Seq | 26/46.5 | 9/25.7 | 8/25.6 | 2/30.7 |
| SiScan | 4/251.1 | 1/128.7 | 0/230.7 | 0/92.2 |
For each method and target region, the number of recombination breakpoints is shown before the slash and mean c-gene size (in kb) is shown after the slash.