| Literature DB >> 33983409 |
Abstract
Phylogenomic subsampling is a procedure by which small sets of loci are selected from large genome-scale data sets and used for phylogenetic inference. This step is often motivated by either computational limitations associated with the use of complex inference methods or as a means of testing the robustness of phylogenetic results by discarding loci that are deemed potentially misleading. Although many alternative methods of phylogenomic subsampling have been proposed, little effort has gone into comparing their behavior across different data sets. Here, I calculate multiple gene properties for a range of phylogenomic data sets spanning animal, fungal, and plant clades, uncovering a remarkable predictability in their patterns of covariance. I also show how these patterns provide a means for ordering loci by both their rate of evolution and their relative phylogenetic usefulness. This method of retrieving phylogenetically useful loci is found to be among the top performing when compared with alternative subsampling protocols. Relatively common approaches such as minimizing potential sources of systematic bias or increasing the clock-likeness of the data are found to fare worse than selecting loci at random. Likewise, the general utility of rate-based subsampling is found to be limited: loci evolving at both low and high rates are among the least effective, and even those evolving at optimal rates can still widely differ in usefulness. This study shows that many common subsampling approaches introduce unintended effects in off-target gene properties and proposes an alternative multivariate method that simultaneously optimizes phylogenetic signal while controlling for known sources of bias.Entities:
Keywords: molecular evolution; phylogenetic inference; phylogenetic signal; phylogenomics; systematic biases
Mesh:
Year: 2021 PMID: 33983409 PMCID: PMC8382905 DOI: 10.1093/molbev/msab151
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Phylogenomic Data Sets Employed.
| Data Set | Age (Ma) | Number of Taxa | Number of Loci | Occupancy (%) | Mean Locus Length |
|---|---|---|---|---|---|
| Actinopterygii ( | 376.3 | 302 | 1,035 | 81.2 | 167.1 |
| Araneae ( | 366.1 | 160 | 1,114 | 64.2 | 218.8 |
| Aspergillacea ( | 117.4 | 81 | 1,660 | 97.5 | 633.8 |
| Blattodea ( | 206.7 | 45 | 2,556 | 82.1 | 374.4 |
| Echinoidea ( | 265.0 | 34 | 2,356 | 71.6 | 257.1 |
| Gnathostomata ( | 457.6 | 100 | 4,543 | 81.6 | 430.4 |
| Heliozelidae ( | 84.0 | 38 | 1,040 | 92.2 | 271.4 |
| Hemipteroids ( | 420.3 | 171 | 2,225 | 90.6 | 771.0 |
| Hexapoda ( | 479.1 | 134 | 1,467 | 94.7 | 869.5 |
| Hymenoptera ( | 281.0 | 169 | 2,665 | 84.8 | 647.6 |
| Lepidoptera ( | 299.5 | 186 | 2,021 | 88.8 | 359.4 |
| Monilophytes (Shen, Jin, et al. 2018) | 321.1 | 69 | 2,357 | 89.5 | 284.3 |
| Myriapoda ( | 504.4 | 40 | 1,942 | 82.2 | 297.1 |
| Opiliones ( | 414.2 | 54 | 1,288 | 63.2 | 265.7 |
| Phasmatodea ( | 121.8 | 38 | 1,022 | 88.6 | 772.3 |
| Pseudoscorpiones ( | 337.5 | 41 | 2,110 | 63.2 | 376.1 |
| Saccharomycotina (Shen, Opulente, et al. 2018) | 404.0 | 332 | 2,348 | 88.1 | 464.6 |
| Scorpiones ( | 381.3 | 30 | 1,462 | 86.6 | 226.3 |
Note.—Age constitutes the inferred date of the last common ancestor of the ingroup (in million years, My) as estimated by the same study. Number of taxa corresponds only to ingroup taxa, number of loci to those for which all properties could be estimated (see Materials and Methods); these and other numbers can differ from those reported in the original studies.
Fig. 1.Gene properties covary in predictable ways, revealing underlying patterns of evolution that are shared by all phylogenomic data sets. The dendrogram shows that the eigenvectors of PC axes can be clustered into two major groups, labeled as patterns A and B. While pattern A is generally captured by PC 1 (green icons) and pattern B by PC 2 (orange icons), the hexapod and phasmatodean data sets are inverted. The histograms on the bottom she the distribution of loadings across variables. Results using k-means clustering are shown in supplementary figure S3, Supplementary Material online.
Fig. 2.Rate of evolution is the primary factor driving differences in gene properties. Scores of loci along PCs 1 (A) and 2 (B) were correlated against the log-transformed harmonic means of site rates. Blue lines correspond to LOESS regressions, and Spearman’s rank correlation coefficients (ρ) are shown in each plot. Clade icons are as in figure 1; the deviating hexapod and phasmatodean data sets are highlighted in red. Results using a tree-based estimate of evolutionary rates are shown in supplementary figure S4, Supplementary Material online.
Fig. 3.Comparison of the performance of alternative subsampling strategies. (A) Distribution of ranks attained by different strategies (lower ranks represent better results). Two criteria for selecting adequate strategies are highlighted: those whose median ranks are lower than randomly chosen loci (grey background), and those that outperform these in more than half of the data sets (yellow bars). The proportion of times a given strategy ranks better than random loci is shown at the bottom. Results correspond to matrices of 250 loci; those for 50 loci are shown in supplementary figure S7, Supplementary Material online. (B) NMDS of pair-wise distances between strategies, representing the average frequency with which they share loci (smaller distances represent higher probabilities of targeting the same loci). Average RF similarity (orange lines) is overlayed as a smooth surface. PC 2 defines an axis that traverses the RF similarity gradient, whereas PC 1 (and other rate proxies) sample genes along a perpendicular axis that follows an isocline.
Fig. 4.Detection of outlier genes using multiple gene properties in two exemplary data sets, Lepidoptera (left) and Pseudoscorpiones (right). Plots show the PC axes built from the entire data sets, with the genes considered outliers shown in red. The topology of the largest outlier (highlighted with a black border) is plotted.