Literature DB >> 34390574

An Efficient, Nonphylogenetic Method for Detecting Genes Sharing Evolutionary Signals in Phylogenomic Data Sets.

Luiz Thibério Rangel¹, Shannon M Soucy², João C Setubal³, Johann Peter Gogarten^4,5, Gregory P Fournier¹.

Abstract

Assessing the compatibility between gene family phylogenies is a crucial and often computationally demanding step in many phylogenomic analyses. Here, we describe the Evolutionary Similarity Index (IES), a means to assess shared evolution between gene families using a weighted orthogonal distance regression model applied to sequence distances. The utilization of pairwise distance matrices circumvents comparisons between gene tree topologies, which are inherently uncertain and sensitive to evolutionary model choice, phylogenetic reconstruction artifacts, and other sources of error. Furthermore, IES enables the many-to-many pairing of multiple copies between similarly evolving gene families. This is done by selecting non-overlapping pairs of copies, one from each assessed family, and yielding the least sum of squared residuals. Analyses of simulated gene family data sets show that IES's accuracy is on par with popular tree-based methods while also less susceptible to noise introduced by sequence alignment and evolutionary model fitting. Applying IES to an empirical data set of 1,322 genes from 42 archaeal genomes identified eight major clusters of gene families with compatible evolutionary trends. The most cohesive cluster consisted of 62 genes with compatible evolutionary signal, which occur as both single-copy and multiple homologs per genome; phylogenetic analysis of concatenated alignments from this cluster produced a tree closely matching previously published species trees for Archaea. Four other clusters are mainly composed of accessory genes with limited distribution among Archaea and enriched toward specific metabolic functions. Pairwise evolutionary distances obtained from these accessory gene clusters suggest patterns of interphyla horizontal gene transfer. An IES implementation is available at https://github.com/lthiberiol/evolSimIndex.

Entities: Chemical

Keywords: Archaea; bioinformatics; clustering; gene coevolution; phylogenomics

Mesh：

Year: 2021 PMID： 34390574 PMCID： PMC8483891 DOI： 10.1093/gbe/evab187

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Significance

Detecting shared evolutionary trends of gene families is necessary for distinguishing genes with incompatible evolutionary signals from those suitable to reconstruct reliable phylogenomic trees. Commonly used methods to achieve this directly compare the topologies of gene trees, which tend to be inaccurate given the inherent uncertainty of phylogenetic reconstruction. We propose the Evolutionary Similarity Index (), based on orthogonal distance regressions between evolutionary distance matrices. Simulations show that substantially outperforms tree-based methods, at a fraction of the computational effort, and is able to evaluate similarities between gene families containing duplication events. We used to assess the compatibility between archaeal gene families, producing a cohesive cluster of 62 genes with similar evolutionary signals, likely representing a central evolutionary trend of archaeal genomes.

Introduction

Phylogenies reconstructed from single genes are known to poorly reflect the underlying history of whole genomes; consequently the detectable phylogenetic signal from an isolated locus cannot be extrapolated to reflect the evolution of whole genomes (Dagan and Martin 2006; Bapteste et al. 2009; Koonin et al. 2009). To ameliorate this effect, it has become common practice to estimate species’ evolutionary histories by concatenating multiple sequence alignments of single-copy genes widely conserved across sampled genomes. The preference toward using core genome sequences is due to their expected resistance to horizontal gene transfer (HGT) (Thomas and Nielsen 2005; Sorek et al. 2007; Popa and Dagan 2011); however, despite the lower frequency of HGT among some gene families, it has been shown that horizontal exchange also occasionally affects core genes. In fact, the slow substitution rate and corresponding high sequence conservation of the core genome may even favor HGT, permitting increases in neutral and nearly neutral HGT at the genus and species levels (Papke and Gogarten 2012; Shapiro et al. 2012). Given this context, it is clear that more rigorous methods are needed to identify genes best reflecting the underlying vertical evolutionary signal in a group of species; such methods should seek to maximize the compatibility between evolutionary signals in order to provide a more robust basis for phylogenomic reconstruction. Many strategies have been proposed to assess similarities between phylogenetic signals obtained by individual gene trees—for example, Robinson–Foulds bipartition compatibility (RF) (Robinson and Foulds 1981), geodesic distance ( (Kimmel and Sethian 1998; Billera et al. 2001; Kupczok et al. 2008; Owen and Provan 2011), matching split distance () (Lin et al. 2012; Bogdanowicz and Giaro 2012), and quartet distance () (Estabrook et al. 1985)—as well as other methods that assess similarities between phylogenetic profiles (Pellegrini et al. 1999; Vert 2002; Barker and Pagel 2005; Liu et al. 2018). The majority of tree-based methods rely on straightforward comparisons between tree topologies (Robinson and Foulds 1981; Kunin et al. 2005; Leigh et al. 2008; Puigbò et al. 2009; Lin et al. 2012; Bogdanowicz and Giaro 2012; Mirarab et al. 2014; Gori et al. 2016). However, although an intuitive solution, comparisons between tree topologies require phylogenetic trees of all assessed gene families to be accurately reconstructed, adding a substantial computational cost to an already computationally demanding task. Furthermore, the vastness of tree space, combined with the inherent uncertainty of phylogenetic reconstruction, provides multiple sources of error to tree-based evolutionary similarity assessments. Another method to assess the evolutionary compatibility of genes is based on similarities between patterns of presence and absence (phylogenetic profiles) of such genes among genomes of interest. Although more recent implementations displayed substantial improvements (Liu et al. 2018), reliance on an initial reference tree constitutes an obstacle to its application to new taxon samplings. Phylogenetic profile-based methods also do not assess divergences between sequences of homologous genes, which limits the resolution of their results. Accounting for uncertainty-based variations in tree topology (i.e., bipartition support) further increases the computational burden and decreases the resolution of the evaluated phylogenetic signal (e.g., collapsing low support bipartitions or weighing them based on support). A proposed solution to bypass the computational cost of tree similarity assessments is Pearson’s correlation coefficient (r) between evolutionary distance matrices (Goh et al. 2000; Pazos and Valencia 2001; Novichkov et al. 2004; Rangel et al. 2019). Unlike tree-based comparisons, methods based on Pearson’s r enable simple implementations to detect similar evolutionary signals between gene families with histories complicated by multiple homologs within genomes. This is accomplished by estimating multiple correlation coefficients, each using distinct combinations of paralogs between gene families (Gertz et al. 2003; Ramani and Marcotte 2003). Despite its application in protein–protein interaction studies, the sensitivity of Pearson’s r to noise in evolutionary distances and the granularity of its estimates have yet to be compared with those of tree-based metrics. Direct coupling analysis has also been used to pair gene copies between possibly coevolving gene families (Gueudré et al. 2016), but despite positive results the assumption that protein products of coevolving genes must be directly interacting may limit its applications. Given the limitations of the aforementioned approaches and methods, we propose the to quantify the similarity between evolutionary histories based on weighted orthogonal distance regression (ODR) (wODR) between pairwise distance matrices. We show that is robust to dissimilarity saturation resulted by up to 50 simulated perturbations to underlying phylogenies. More common tree-based evolutionary similarity estimates must be corrected for distance saturation to display similar robustness and are significantly more susceptible to errors in evolutionary history reconstruction. As a case study for this new method, we assessed evolutionary similarities across 1,322 archaeal gene families and detected significant evolutionary incompatibilities between conserved single-copy genes, as well as a clear central evolutionary tendency involving 62 gene families that occur as both single and multiple copies across genomes.

Results and Discussion

Simulated Data Set

Similarities between evolutionary histories of simulated gene families were assessed using our newly proposed and four tree-based metrics: , RF, , and . Results from all five approaches successfully identified the monotonic decrease in similarity between simulated gene families as the number of perturbations increased (supplementary figs. S2 and S3, Supplementary Material online); tree-based estimates, however, were heavily impacted by noise introduced through empirical sequence alignment methods and suboptimal evolutionary model (fig. 1 and supplementary figs. S2–S4, Supplementary Material online). The accuracy of each method in identifying the degree of similarity between simulated gene families was evaluated by calculating ordinary least squares (OLS) between pairwise shared evolution estimates ( or for tree-based metrics) and number of simulated perturbations. OLS models were fitted using a Y-axis intercept of 0 as two gene families with identical histories should yield no differences according to any method (supplementary figs. S2 and S3, Supplementary Material online).

Boxplots of OLS for shared evolution estimates from perfectly aligned simulated gene families for each data set of simulated phylogenetic perturbations. Each data set was replicated ten times, and scatterplot and fitted OLS regressions are available in supplementary figure S2, Supplementary Material online. Negative values occur as fitted linear regressions do not explain the association between variables, and in this scenario reflect strong saturation of evolutionary similarity measurements. Under perfect conditions (i.e., no sequence alignment errors and optimal evolutionary model), performed on par with tree-based adjusted estimates (fig. 1). Among gene families simulated with weak phylogenetic perturbations, significantly outperformed tree-based methods as measured by OLS . Adjusted performed similarly to among gene families with medium perturbations. In simulations generated using strong phylogenetic perturbations, adjusted significantly outperformed as a predictor of differences between simulated gene families; for simulations with strong phylogenetic perturbations no significant difference was detected between , RF, and . Among simulated genes reconstructed with perfect information, displayed greater accuracy when compared with established tree-based methods even though three out of four tree-based methods ignore branch lengths (i.e., , , and ), and consequently are not affected by perturbations applied to branch lengths. Among tree-based methods, was shown to be the most negatively impacted by strong phylogenetic perturbations between simulated gene trees. Although distributions obtained from SPR10 and SPR50 data sets are not significantly different from other tree-based methods, showed an abrupt decrease in performance among SPR100 gene families. This may be due to a high sensitivity to long SPR moves or suggest that the applied saturation correction is not well suited for its full range of distance estimates. Phylogenetic reconstruction using a simulated sequence alignment (tree_1_TRUE.phy, from SPR10 and replicate_1) under LG+G model by IQTree 1.6.7 took 393.6 s in a single thread; computing the pairwise distance matrix for the same alignment took 6.487 s. Both computations were performed on a 3 GHz Intel Xeon W processor. Although may provide marginal gains in extreme scenarios in comparison to , it comes with an added computing time of almost 150×, without bipartition support assessment. When assessing large data sets, the quadratic nature of distance matrices will decrease the reported disparity between computing times of and tree-base methods, even though the latter still rely in bipartition support values for enhanced accuracy. Both the time and computing resources required for reasonable phylogenetic reconstruction constitute prohibitive factors toward the assessment of shared evolution in large multi-genome data sets.

Robustness Assessment between Approaches

The dichotomic pattern in a cladogram is extremely susceptible to uncertainties in phylogenetic reconstruction. Combined with the vast tree space available for 50 taxa, this can cause noise-induced topological variations to be not directly distinguishable from real deviations in evolutionary history (Szöllosi et al. 2013). Pairwise maximum likelihood distance matrices used to estimate are less prone to such uncertainty as they bypass forming hypotheses about the evolutionary relationships between taxa. This assumption is corroborated by comparing evolutionary similarity estimates obtained using error-free sequence alignments and optimal evolutionary models to those using realigned sequences and suboptimal evolutionary models. Although alignment errors (average SP-score of 0.87) and evolutionary model suboptimality (JTT) were kept to the minimum one could expect in empirical data sets, the resulting noise is sufficient to negatively impact RF, , and estimates. In all three simulated data sets (SPR10, SPR50, and SPR100) tree-based methods, except , displayed significantly smaller OLS when compared with the number of phylogenetic perturbations between simulated gene families (supplementary fig. S5, Supplementary Material online). On the other hand, estimates were shown to be robust toward error-induced noise in both SPR50 and SPR100 data sets (supplementary fig. S5, Supplementary Material online).

Evolutionary Similarities within Archaeal Gene Families

In order to test performance when estimating shared phylogenetic signal in an empirical set of gene families, we evaluated 1,322 families of homologous proteins assembled from annotated coding sequences (CDS) extracted from 42 archaeal genomes (supplementary table S1, Supplementary Material online). This empirical data set contains conserved and accessory gene families with different sizes due to gene losses, duplications, and transfers. was estimated for all pairwise combinations of gene families present in at least ten genomes, with 2,142 out of 748,712 comparisons having values of at least 0.7 (supplementary fig. S6, Supplementary Material online). Pairs of gene families with an were added as nodes to a network with pairwise as edge weights connecting gene families. In total 419 unique archaeal gene families were added to the network, whereas the remaining 903 gene families did not display any with other gene families. The 0.7 threshold was selected as it is the most robust to threshold increases, as measured by Variation of Information (Meilă 2007), while also yielding the greater cluster modularity than networks obtained using lower thresholds (supplementary table S2, Supplementary Material online). Although increasing the threshold from 0.7 to 0.75 leads to a 26% increase in cluster modularity, it does so by removing 71.15% of edges (supplementary table S2, Supplementary Material online). Previous applications based on Pearson’s r between evolutionary distances have also suggested a 0.7 threshold (Goh et al. 2000; Pazos and Valencia 2001). The resulting evolutionary similarity network (fig. 2) is heavily imbalanced, with just 11% of nodes involved in 50% of network edges. The majority of gene families (68%) did not display any above the 0.7 threshold with other gene families, suggesting a general incompatibility between evolutionary signals, or an inability to detect compatibility with this method. However, the high edge concentration within just a few nodes suggests a strong central signal present among few gene families, from which the evolutionary trajectories of others have diverged (Puigbò et al. 2009).

A compatible evolutionary signal network, with each node representing a gene family, and edges connecting nodes representing shared evolutionary signals (). Nodes of the same colors have similar evolutionary trends, as identified by Louvain community detection. Triangular nodes represent single-copy genes, and circular nodes are gene families containing duplications. Clusters of CES gene families with less than ten members are not represented. A hierarchical clustering of pairwise based on average nucleotide distances between gene families within individual genomes suggests that shared evolution estimates are strongly impacted by genetic linkage (supplementary fig. S6, Supplementary Material online). Pairs of gene families within close genomic proximity (i.e., apart from each other by fewer than 10,000 bp in at least 21 genomes) have significantly more similar evolutionary trends than pairs separated by long genomic distances (i.e., apart from each other by more than 100,000 bp in at least 21 genomes), as depicted in figure 3 ( and ). Pairs of gene families with strong genetic linkage displayed significantly greater relative to pairs of gene families with weak linkage.

(A) Distributions of between pairs of genes within 10,000 bp of each other (blue) and between pairs of genes apart by at least 100,000 bp (orange). Neighboring gene pairs displayed significantly more similar evolutionary signals than nonneighboring gene pairs. (B) Ratio between the proportion of gene pairs in intra- and inter-CES clusters, Y axis, occurring within genomic windows, X axis. About 100 window sizes were assessed ranging from 1,000 to 1,000,000 bp.

Clusters of Gene Families with Compatible Evolutionary Signals

Evolutionary trends shared across gene families were grouped using Louvain community detection, recovering 41 compatible evolutionary signal (CES) clusters, of which 25 comprise only two gene families and eight major clusters contain ten or more similarly evolving gene families. As evidence of the effectiveness of clustering gene families using pairwise , we observe that CES clusters are strongly associated with genetic linkage within short genetic distances (fig. 3). Across short nucleotide distances between loci, linkage is a strong predictor of CES relations between genes, but its predictive power rapidly decreases as the genomic distance between two given loci increases (fig. 3), displaying a linear log–log relationship (supplementary fig. S7, Supplementary Material online). Comparing frequencies of intra- and intercluster gene pairs across distinct windows of genomic distances showed that the proportion of CES genes within 1,000 bp of each other is three times the proportion of non-CES genes within the same window. Increasing the window size led to abrupt decreases in proportion; within a 10,000-bp window, the ratio of CES genes is reduced to 1.8 the ratio of non-CES, and at a 100,000-bp window this difference in proportion falls to 1.2 (fig. 3). Given the strong genetic linkage and functional associations of operons, these results suggest that the evolutionary signal shared by gene pairs in known operons might be even stronger than that shown by figure 3. Among the eight CES clusters with ten or more gene families, four are comprised mostly core genes, and four are composed of mostly accessory genes (fig. 4). The four CES clusters of core genes (cluster#2, cluster#3, cluster#4, and cluster#5) are promising candidates for reconstructing a representative phylogenetic signal present within sampled Archaea. These four core CES clusters are composed of 102 extended core genes (single copy and present in at least 35 genomes) and 146 broadly distributed gene families present on average in 33 genomes, both as single and multiple copies. CES clusters of accessory genes (cluster#0, cluster#1, cluster#8, and cluster#15 in fig. 5) include specific archaeal clades, but do not map to well-established phylogenetic relationships; rather, they show polyphyletic gene distributions, likely caused by HGTs and/or gene losses shared by CES gene families. For example, cluster#0 is well represented amongst Euryarchaeota and hyperthermophilic TACK; cluster#15 comprises gene families with shared evolutionary trends mainly occurring within Crenarachaeota and hyperthermophilic Euryarchaeota; CES accessory gene families in cluster#1 and cluster#8 display congruent signals grouping methanogenic Euryarchaeota with Thaumarchaeota and Asgardarchaeota, respectively. Besides the eight CES clusters with ten or more gene families, the CES network community detection yielded 34 other clusters containing between two and nine gene families (see Supplementary Material online). These 34 CES clusters contain a total of 88 gene families, with degree centralities much smaller than the 331 gene families within the eight major CES clusters (averages of 1.36 and 17.79, respectively).

Fig. 4

Phylogenetic tree of Archaea reconstructed from 62 genes within CES cluster#4. The phylogeny was obtained using LG+F+G+C60 evolutionary model from IQTree and each gene had its parameters independently estimated according to parameter “-sp.” Bipartition supports were estimated using UFBoot and aLRT, each with 1,000 replicates, and bipartitions well supported by both methods are colored in green (UFBoot and aLRT ), whereas bipartitions well supported by a single method are colored in yellow. Red dotted lines indicate Nanohaloarchaea (1) and Nanoarchaea (2 and 3) placements reconstructed within phylogenies containing a single DPANN genome at a time. Despite the lack of outgroups to Archaea within our sample the tree is rooted in DPANN for the sake of visualization. The associated heatmap reflects the representation of gene families within CES clusters amongst archaeal genomes.

Heatmap of enriched KEGG Pathways within each CES cluster of gene families. Shades of red represent the proportion of genomes with detected KEGG Pathway enrichment within its homologs of CES gene families. Columns and rows were clustered using complete linkage and correlation coefficients. KEGG Pathways enriched in fewer than 10% of genomes in which CES genes occur are not reported. Cluster#15 did not show significant enrichment of KEGG Pathways. Phylogenetic tree of Archaea reconstructed from 62 genes within CES cluster#4. The phylogeny was obtained using LG+F+G+C60 evolutionary model from IQTree and each gene had its parameters independently estimated according to parameter “-sp.” Bipartition supports were estimated using UFBoot and aLRT, each with 1,000 replicates, and bipartitions well supported by both methods are colored in green (UFBoot and aLRT ), whereas bipartitions well supported by a single method are colored in yellow. Red dotted lines indicate Nanohaloarchaea (1) and Nanoarchaea (2 and 3) placements reconstructed within phylogenies containing a single DPANN genome at a time. Despite the lack of outgroups to Archaea within our sample the tree is rooted in DPANN for the sake of visualization. The associated heatmap reflects the representation of gene families within CES clusters amongst archaeal genomes. CDSs from 21 out of 42 sampled genomes have functional annotations available in StringDB (see Supplementary Material online), which was used to identify annotated KEGG Pathways enriched within homologs of CES gene families from each genome. In the dendrogram and heatmap depicted in figure 4, we clearly identify two sets of opposing CES clusters of gene families: accessories (top three rows) and core (bottom four rows). All four CES clusters of core gene families are enriched with KEGG Pathways related to genetic information processing (e.g., Ribosome, DNA replication, and Aminoacyl-tRNA biosynthesis), whereas CES clusters of accessory gene families are enriched with KEGG Pathways related to metabolism (e.g., Methane metabolism, Microbial metabolism in diverse environments, and Biosynthesis of antibiotics in fig. 4). KEGG Pathways related to metabolism display minor enrichment signal within CES clusters of core gene families, and KEGG Pathways related to genetic information processing are not enriched within clusters of accessory genes (fig. 4). Cluster#1, restricted to methanogenic Euryarchaeota and Thaumarchaeota, is enriched for methane metabolism genes within six genomes. Similarly, gene families from cluster#8, restricted to methanogenic Euryarchaeota and Asgardarchaeota, are also enriched in methane metabolism in five genomes (fig. 4).

Compatible Evolutionary Signal Clusters and Possible Vertical Evolutionary Signals

Phylogenies generated from extended core genomes are generally used as reasonable proxies of the species-tree phylogeny, given the assumption that these genes are less likely to undergo HGT between distantly related groups. However, an extended core phylogeny may not represent the species tree for several reasons, including systematic biases in phylogenetic reconstruction due to shared compositional bias, or strong biases in HGT partners among sets of genes. The extended core tree can still be used as an adequate representation of the consensus evolutionary signal detected in the sampled archaeal genomes, the closest thing we have to the simple “null hypothesis” of a shared history due to vertical inheritance. The 102 genes composing the extended core genome are not equally distributed across CES clusters (fig. 2); cluster#4 contains the greatest number of extended core genes, 44 out of 62 gene families, followed by cluster#3 with 27 out of 111 gene families. The split of the extended core genome into four distinct major CES clusters (fig. 2) suggests differing sets of HGTs among core genes, creating conflicting evolutionary histories between genes from different clusters (fig. 6). Closeness centrality measures () and node strength corrected by cluster size () suggest that cluster#4 gene families share stronger and more cohesive evolutionary trends than gene families from other clusters (supplementary fig. S8, Supplementary Material online). Therefore, cluster#4 contains the set of genes that may be best able to recover a representative evolutionary history across sampled Archaea. Cluster#4’s evolutionary history is also the most similar to that inferred from the extended core genome (figs. 5 and 6). The phylogeny obtained from concatenated cluster#4 genes is highly similar to recently published reconstructions of the archaeal Tree of Life, with virtually identical Euryarchaeota clade structure (Williams et al. 2017, 2020).

Scatterplots of pairwise evolutionary distances between gene families. Pairwise distances between CES clusters are shown in blue, whereas pairwise distances between CES clusters and the 102 core gene data set are shown in red. Similarities between evolutionary histories were estimated by . Although binning genes with congruent evolutionary histories permits phylogenetic reconstructions less likely to be subject to spurious signals arising from the averaging of conflicting evolutionary signals, the resulting phylogenies remain susceptible to phylogenetic reconstruction artifacts. For example, since is not estimated from phylogenies but from pairwise evolutionary distances, we do not expect it to be subject to long-branch attraction (LBA) artifacts. Nevertheless, phylogenies reconstructed from sets of genes with high between each other are as susceptible to LBA as any other data set. Despite robustness to phylogenetic artifacts, estimates are still affected by sampling biases: the overrepresentation of specific taxonomic groups can lead to underestimating deviations in the evolutionary history of less represented groups. LBA is a frequently invoked in discussions of archaeal phylogeny, specifically with regard to the phyletic status of the DPANN group (Brochier-Armanet et al. 2011; Petitjean et al. 2014; Raymann et al. 2014; Williams et al. 2015; Feng et al. 2021). Regardless of the set of genes used for phylogenetic reconstruction, extended core genome or any of the four CES clusters of core genes, all resulting trees depicted a well-supported DPANN clade composed of Nanohaloarchaea archaeon, Ca. Woesearchaeota archaeon, Nanoarchaeum equitans, and Ca. Diapherotrites archaeon. To assess the impact of LBA in reconstructing DPANN, we generated phylogenies from the CES clusters of core genes including only a single DPANN taxon at a time. When included individually, each DPANN taxon showed varying placements across trees generated from different CES clusters. This suggests that the initial monphyletic grouping of DPANN for these clusters was, in fact, artifactual and that the extended core gene history for these genomes is likely complex (see Supplementary Material online). Cluster#4 phylogenies individually testing the position of each DPANN taxa placed Nanohaloarchaea sister to Halobacteria, with Nanohaloarchaea+Halobacteria being sister to Methanomicrobia. Cluster#3 also reported Nanohaloarchaea sister to Halobacteria, but with both nested within Methanomicrobia. This placement for Nanohaloarchaea has been previously proposed by Narasingarao et al. (2012), Zhaxybayeva et al. (2013), Petitjean et al. (2014), and Feng et al. (2021). The uncertain placement of Nanoarchaea has also been topic of investigation (Huber et al. 2002; Brochier et al. 2005). Interestingly, each CES cluster recovered a different placement for Nanoarchaea: sister to Euryarchaeota (cluster#2), sister to Korarchaeota+Crenarchaeota (cluster#3), sister to Korarchaeota (cluster#4), and sister to Thermococcales (cluster#5). One of the most accepted Nanoarchaea placements is as sister to Thermococccales (Brochier et al. 2005; Urbonavičius et al. 2008; Dutilh et al. 2014), which in our analyses was recovered only by cluster#5 (see Supplementary Material online). Although our tests further support that the monophyly of DPANN is likely due to LBA, we did not detect a significant LBA effect for Woesearchaeota and Diapherotrites. Except for Diapherotrites placing between Class I and II methanogens in cluster#2, phylogenies from all four clusters proposed both taxa grouping together as sister to Euryarchaeota, assuming an archaeal root between TACK+Asgard and Euryarchaeota (see Supplementary Material online). The disparate placements of DPANN members within trees from CES clusters also suggests that, in addition to LBA, the DPANN clade from the extended core genome phylogeny is further impacted by the heterogeneity of the phylogenetic signal. This may not only produce a “signal averaging” effect favoring a monophyletic DPANN deeper in the archaeal tree, but may also be a contributing factor to the LBA artifact itself. Heterogeneity among combined phylogenetic signals is likely to increase the estimated branch length, as the incorrect assumption of a single underlying phylogeny will lead to more homoplastic sites. Assuming that a given set of genomes constitutes a monophyletic clade, it is also reasonable to expect a certain number of gene families to be over represented within the clade and not readily available to genomes outside the clade. Regardless of the driving force behind the enrichment of gene families within a clade, which may be inheritance from a common ancestor or biased HGT (Andam et al. 2010; Andam and Gogarten 2011), we identified 80 gene families enriched within TACK genomes and 111 within Euryarchaeota (). In contrast to TACK and Euryarchaeota clades, we did not detect any gene families enriched within the four sampled DPANN genomes, providing phylogenetically independent evidence against its monophyly. Complementarily, and in support of a Nanohaloarchaea+Halobacteria clade, we identified 78 genes present in Nanohaloarchaea archaeon enriched within the three Nanohaloarchaea and Halobacteria genomes. In another contrast, Thaumarchaeota, a well-accepted clade with similar number of sampled genomes in our data set, was found to have 456 gene families enriched within its five genomes. Although the differing degrees of physiological, metabolic, and genetic diversity within these groups certainly influence the number of shared gene families, it remains striking that this particular signal of shared ancestry is conspicuously lacking in DPANN.

Common and Distinct Evolutionary Trends between CES Clusters

Among CES clusters of core gene families, cluster#4 and cluster#5 are most evenly represented across archaeal groups, whereas cluster#2 and cluster#3 are poorly distributed among DPANN (fig. 5). All four CES clusters of core gene families have low frequency within Thaumarchaeota archaeon SCGC AB-539-E09, and only cluster#2 and cluster#4 are present in any abundance in Thermoplasmatales archaeon SCGC AB-539-N05, 29 and 44 gene families respectively. All four clusters display very similar overall phylogenies calculated from concatenations of genes within each cluster, varying mainly within the organization of Euryarchaeota (fig. 5 and Supplementary Material online). All four CES clusters of core genes reconstructed the monophyly of Euryarchaeota, with the exception of cluster#2, which placed Pyrococcus furiosus, Thermococcus kodakarensis, Methanocaldococcus jannaschii, Methanothermobacter thermautotrophicus, and Methanopyruus kandleri together as sister to Asgardarchaeota+TACK. Only cluster#4 recovered the monophyly of Methanomicrobia as sister to Halobacteria, as supported by Bayesian reconstructions reported in (Martijn et al. 2020; Williams et al. 2020). The other three CES clusters place Halobacteria within Methanomicrobia, as reported by (Williams et al. 2017) and in the RNA polymerase phylogeny reported in (Da Cunha et al. 2017). All four core CES clusters robustly identified Asgardarchaeota as sister to TACK (fig. 5), with small variation in the Asgardarchaeota phylogeny, and cluster#5 placed Korarchaeota at the base of the TACK superphylum (see Supplementary Material online) as previously reported in the literature (Williams et al. 2017, 2020); the remaining three CES clusters of core genes place Korarchaeota sister to Crenarchaeota (fig. 5 and Supplementary Material online). When assessing all-versus-all between CES clusters of core genes, the evolutionary signal detected from cluster#4 is the least dissimilar to the other three (fig. 6). This shortest path from cluster#4’s evolutionary trajectory to others suggests that cluster#4 best approximates the average archaeal evolutionary history (fig. 6). In general, the overall high estimates between core CES clusters suggest that despite composing distinct clusters, gene histories between clusters are generally congruent, with deviations reflecting small divergences potentially representing genes with specific sets of reticulate histories. Phylogenetic trees obtained from accessory gene families in cluster#0, cluster#1, cluster#8, and cluster#15 reconstructed all represented archaeal phyla as monophyletic (except for P. furiosus in Euryarchaeota in cluster#0, supplementary fig. S9, Supplementary Material online), suggesting a shared common origin of accessory genes from each CES cluster by all genomes from the same phylum. Although the monophyly of archaeal phyla within trees of CES clusters of accessory genes does not permit an accurate prediction of the directionality of possible interphyla HGTs, intraphylum distances congruent to the supposed vertical inheritance signal can be used to evaluate interphylum distances under a wODR model (supplementary figs. S10, S12, S14, and S15, Supplementary Material online). When compared with pairwise distances expected from vertical inheritance, interphylum distances that are significantly shorter than estimates obtained from intraphylum distances may be attributed to HGT acquisition by one of the phyla in question. For each CES cluster of accessory genes, we assessed wODR of its pairwise distances against the inferred vertical evolution signal estimated from cluster#4. When comparing pairwise distances obtained from cluster#1 against cluster#4, distances between Euryarchaeota and Thaumarchaeota are consistently placed below the estimated regression line (supplementary figs. S10 and S11, Supplementary Material online). This suggests that cluster#1 genes were horizontally transferred between ancestors of both phyla, causing shorter evolutionary distances between phyla than expected if their homologs diverged exclusively by vertical inheritance. Interphyla distances between Euryarchaeota and Crenarchaeota obtained from cluster#0 fit the evolutionary rate expected using intraphylum distances for this CES cluster (supplementary fig. S12, Supplementary Material online), suggesting that homologs from both phyla were vertically inherited from a common ancestor. Different behavior was seen for cluster#0 interphyla distances involving Thaumarchaeota (Crenarchaeota to Thaumarchaeota and Euryarchaeota to Thaumarchaeota), which are shorter than expected from the wODR using intraphylum distances (supplementary fig. S12, Supplementary Material online) and display significantly greater residuals than distances between Crenarchaeota and Euryarchaeota (supplementary fig. S13, Supplementary Material online). The absence of cluster#0 genes among Asgardarchaeota and Korarchaeota and the short interphyla distances to Thaumarchaeota homologs suggest an extensive loss among other phyla and horizontal acquisition by the thaumarchaeal ancestor from either crenarchaeal or euryarchaeal donors. Despite the occurrence of accessory genes from cluster#1 and cluster#8 in methanogenic Euryarchaeota (fig. 5) and the enrichment of methane metabolism pathways (fig. 4), evolutionary histories of both CES clusters are not related (fig. 2). Gene families in CES cluster#8 did not display outside its own cluster, constituting a separate connected component in the CES network depicted in figure 2. That said, cluster#8 gene families display shorter Euryarchaeota-Asgardarchaeota distances when compared with cluster#4 distances, but unlike cluster#0 and cluster#1, intra-Asgardarchaeota and intra-Euryarchaeota pairwise distances are not mutually compatible under a single linear regression (supplementary fig. S14, Supplementary Material online). The lack of a strong wODR anchor in the form of intraphyla distances suggests a more complex horizontal exchange history of cluster#8 genes, possibly involving intraphylum HGTs, which we cannot accurately assess with the data set used in this study. CES cluster#15 of accessory genes is well distributed among Crenarchaeota, and its intraphylum pairwise distances are congruent to cluster#4 distances, but their patchy occurrence among Euryarchaeota and Korarchaeota (fig. 5) does not permit a confident assessment of this cluster’s interphyla evolutionary history (supplementary fig. S9, Supplementary Material online).

Consistency of Duplicated Gene Copies within CES Clusters

Among the eight larger CES clusters, 89 gene families occur in multiple copies among genomes; in order for CES phylogenies be reconstructed a single copy must be selected as the best representative of the evolutionary signal shared by CES genes. These 89 gene families are found in multiples a total of 237 times across the 42 sampled archaeal genomes, 52 times within CES cluster#4. During each wODR between pairs of gene families only the copy yielding the least sum of squared residuals is selected as best representing the shared history by both families. In 71.7% of cases when multiple copies are present in a genome the same copy is supported by more than 70% of similarly evolving gene families (supplementary fig. S16, Supplementary Material online). The significant difference between the observed support of selected copies against a null distribution where each copy has the same probability of yielding the least sum of squared residuals further corroborates ’s effectiveness and robustness to stochastic noise (supplementary fig. S16, Supplementary Material online).

Conclusions

We have presented , a new, robust, and efficient method to detect gene families with compatible evolutionary histories, which may predict good candidates to be used in phylogenomic tree reconstructions. The distance regression basis of our proposed method does not require hypotheses regarding evolutionary relationships between taxa represented by the branching pattern of phylogenetic trees. Besides significant gains in accuracy and computing efficiency compared with other tree-based approaches, introduces a new and robust strategy to pair copies of gene families that best represent shared evolutionary trends. The strong effect of genetic linkage in pairwise estimates within archaeal gene families constitute independent evidence of ’s ability to recover shared evolutionary histories within empirical data sets. Despite similar performances of Pearson’s r and wODR in detecting these trends, achieves the same result in a more efficient way. The utilization of wODR also imparts more robust statistical support not directly available to previous Pearson’s r implementations, whereas the assessment of pairwise distances between taxa provides robustness in the presence of artifacts associated with phylogenetic inference (e.g., LBA). The ability to assess residuals of each data point independently also allows for evaluations of specific homologs, a useful tool for HGT detection. can thus be incorporated into phylogenomics pipelines and used to guide the selection of gene families for more accurate and robust species-tree inference, as well as the detection of meaningful clusters of gene families evolved in shared, yet reticulate, patterns. Results obtained from all three simulated scenarios and their replicates corroborate the efficiency of under multiple conditions, which further supports its application to assess distinct data sets. By assessing similarities of evolutionary signal between archaeal gene families using we were able to detect several clusters of shared distinct evolutionary trends. Phylogenetic reconstruction using concatenated sequences from each of the four major CES clusters of core genes provides strong evidence for the existence of four major evolutionary trends. The phylogeny resulting from CES cluster#4, in particular, recovers a species tree hypothesis consistent with that proposed in several other studies, and does so while using a more empirically supported selection of gene families that does not presuppose vertical inheritance. CES clusters obtained using also provide key evidence for horizontal exchange of functionally related genes between phyla (supplementary fig. S10, Supplementary Material online). For example, given the almost exclusive occurrence of genes from CES cluster#1 among methanogenic Euryarchaeota and Thaumarchaeota, tree-based approaches are not able to report the potential HGT between these phyla. Separately, intra- and interphyla distances obtained from CES cluster#1 are strongly correlated to distances described in CES cluster#4, however the significant placement of interphyla distances bellow the wODR line strongly suggests an HGT between ancestors of both phyla. The method used to analyze the archaeal gene set is general and can thus be applied to other genome sets. Furthermore, the implementation described provides a straightforward framework for future improvements and a possible alternative to phylogenetic reconciliation approaches to identify HGT events, as described in supplementary figures S9–S15, Supplementary Material online.

Materials and Methods

ODR is an errors-in-variables regression method that accounts for measurement errors in both explanatory and response variables (Boggs et al. 1987), instead of attributing all errors exclusively to the response variable, as performed by OLS. Although OLS regression models seek to minimize the sum of squared residuals of the response variable, ODR minimizes the sum of squared residuals from each data point obtained by the combination of explanatory and response variables. Novichkov et al. (2004) assessed the compatibility between the evolutionary history of genes with a reference genomic evolutionary history using Pearson’s r and estimates of an OLS regression’s intercept. The latter extra step, when compared with other implementations using solely Pearson’s r (Ramani and Marcotte 2003; Izarzugaza et al. 2008; Gueudré et al. 2016) is required to infer if data points not fitting a regression line through zero are caused by HGT. The approach proposed by Novichkov et al. requires two key conditions that restrict its usage on empirical data sets: 1) there must exist a reference history to which gene histories are compared; and 2) there are no errors in the reference distances between genomes. The approach described here is based on ODR. Its modeling of errors within both assessed variables decreases the necessity of comparing gene family pairwise distances against a well-established reference. When evaluating evolutionary histories of two gene families with no clear separation between explanatory and response errors-in-variables approaches (e.g., ODR) are better suited to compare pairwise evolutionary distances. Our implementation uses a wODR model with regression line through the origin (, where is the Y-axis intercept) to avoid overfitting ODR model to the detriment of coherent evolutionary assumptions. Data points are independently weighed based on residuals from an initial regression line to decrease the model’s susceptibility to few homologs with strong signal incompatibility.

Algorithm Explanation

In the simplest scenario of two gene families occurring as single copies in the same set of genomes, is equal to the coefficient of determination () of a wODR between pairwise distance matrices of both gene families. Data points are weighted in regard to their impact on the model fitting. The weighing step is required to avoid a few outlying sequences preventing the identification of a signal shared by the majority; weights are estimated as the inverse of residuals obtained from a geometric mean regression with intercept equal to zero and slope equal to , where and are the standard deviations of the regressed variables. If two assessed gene families do not occur in the same set of genomes, the wODR is calculated exclusively using homologs from genomes containing both families; the resulting is then adjusted by the Bray–Curtis Index (). is defined as , where is the Bray–Curtis Dissimilarity (Bray and Curtis 1957) calculated from the copy number of each gene family within genomes. The incorporation of unequal genomic occurrence between gene families prevents possible overestimation of evolutionary signal similarity by the wODR caused by gene losses and duplications that are not observed by the regression. Supplementary figure S1, Supplementary Material online, depicts how the decrease in taxa overlap can lead to overestimated shared evolution solely by wODR , and consequently the importance of adjustment. We simulated two gene families separated by five Subtree Prune and Regraft (SPR) transformations and measured the evolutionary similarity between both gene families as we randomly removed one taxon from each simulated gene family (supplementary fig. S1, Supplementary Material online). As the set of taxa used during the regression becomes unrepresentative of underlying evolutionary processes, estimates based only on wODR tend to artificially increase. We will subsequently refer to the wODR product as . A main advantage of our proposed method over tree-based approaches is its ability to quantify the evolutionary signal shared by gene families with different copy numbers within genomes, as depicted in figure 7. When estimating between one gene family occurring exclusively as single-copy (gene1) and another gene family (gene2) with two copies within genome J (j1 and j2), we initially select which of J’s copy of gene2 (j1 or j2) maximizes between both gene families. During an initial wODR step, gene1’s pairwise distances involving its single J copy are duplicated in such a way that distances involving both j1and j2 are initially paired with it (fig. 7); the gene2 copy in genome J yielding the least sum of squared residuals will be kept during further steps. The final pairwise will be estimated using the copy of gene2 that results in the least sum of residuals when paired with the single gene1 copy (j1 in fig. 7). Whenever both gene families occur in multiples within the same genome, all nonoverlapping products of copies from both families are part of the final estimation. In our implementation, wODR is performed through the SciPy (Virtanen et al. 2020) API of ODRPACK (Boggs et al. 1989). Our method’s capability to automatically select gene copies that optimize evolutionary signal similarity between two gene families vastly expands the scope of data sets fit for general evaluation of evolutionary signal compatibilities. The presence of multiple gene copies within a genome constitutes a key bottleneck to methods commonly used to assess the similarity of evolutionary histories. Tree-based evolutionary distance assessment algorithms are not generally capable of pairing genes between two gene families when at least one family contains multiple gene copies within genomes (Stamatakis 2006; Nguyen et al. 2015; Gori et al. 2016; Huerta-Cepas et al. 2016); Pearson r implementations either rely on multiple tests (Gertz et al. 2003; Ramani and Marcotte 2003; Izarzugaza et al. 2008) or on predicting structural interaction between gene products (Gueudré et al. 2016).

Steps for estimation between gene families containing multiple gene copies. tree1 and tree2 are phylogenetic trees of two hypothetical gene families, gene1 and gene2, respectively. matrix1 and matrix2 contain pairwise evolutionary distances between taxa from their respective gene families. The red arrows in matrix1 highlight the duplication of pairwise distances involving the j homolog of gene1 necessary to match dimensions of the two matrices. The wODR scatterplot displays the linear relationships between distances from both gene families, and highlights distances related to the j1 homolog of gene2 in blue and related to the j2 homolog in red. Arrows also highlight pairwise distances homologs in genomes J and I from both gene families.

Statistics and Data Analysis

Pandas Python library (McKinney 2010) was used to perform operations on pairwise distance matrices and for generating condensed versions of the matrices submitted to the wODR model. Effect size (f) hypothesis tests of differences between distributions were obtained using Common Language statistics (McGraw and Wong 1992), and P value correction for multiple tests was performed using False Discovery Rate implementation in StatsModels Python library (Seabold and Perktold 2010). Network community detection was performed using the Louvain clustering method (Blondel et al. 2008) implementation available in the iGraph (Csardi and Nepusz 2006) Python library (igraph.community_multilevel). Enrichment of gene families within sets of genomes were assessed using hypergeometric tests and P values corrected with Benjamini–Hochberg’s false discovery rate and expressed as q values (Benjamini and Hochberg 1995; Benjamini and Yekutieli 2001).

Data Simulation

Our simulated data set is composed of three sets of 50 trees generated through random stepwise perturbations from a single starting tree. Each set of 50 trees differs from the other on the intensity of the stepwise perturbations, which were simulated through 49 consecutive random SPR transformations with varying regrafting ranges. Small perturbations were caused by regrafting pruned subtrees to a branch within 10% of the branches closest to the original placement; medium perturbations regrafted within the 50% closest branches; and strong perturbations regrafted the pruned subtree to any branch in the tree. Additionally, each bipartition’s branch length was multiplied by independently drawn gamma distributed random variables ( and ) after each SPR. These three sets of simulated trees will be referred to as SPR10, SPR50, and SPR100 and were independently replicated ten times. In a real-world scenario, multiple complex mechanisms can shuffle evolutionary signals without altering gene copy numbers (e.g., hidden paralogy, xenologous gene displacement, and incomplete lineage sorting); whereas these mechanisms tend to cause varying levels of perturbation to the underlying evolutionary signal the topological consequence to the tree is the same for all, an SPR. Through consecutive perturbations of the initial tree in the form of random SPR and branch length transformations, we obtained three sets of simulated gene families with histories of greatly varying similarities. All trees were simulated with in-house scripts using ETE3 (Huerta-Cepas et al. 2016). All simulated trees are available in Supplementary Material online. Simulated gene family phylogenies were used to generate sequence alignments containing insertions and deletions using INDELible (Fletcher and Yang 2009) (see Supplementary Material online), which outputs perfectly aligned homologous sites of simulated sequences. Sequences from each resulting simulated gene family were also aligned using MAFFT (Katoh and Standley 2013) with the –auto parameter; both the true alignment provided by INDELible and the empirical alignment generated by MAFFT were used to construct phylogenetic trees and pairwise distance matrices using IQTree (Nguyen et al. 2015). Differences between aligned homologous sites simulated by INDELible and the sequence alignment obtained using MAFFT were assessed using Sum-of-Pairs score (SP-score) calculated by FastSP (Mirarab and Warnow 2011).

Archaeal Empirical Data Set

Complete genome sequences of 42 Archaea from the Euryarchaeota phylum and from TACK, DPANN, and Asgardarchaeota groups were downloaded from NCBI GenBank (Supplementary table S1, Supplementary Material online). Other candidate phyla known from metagenomic sequences as well as some remaining members of the DPANN group were not included, as their expected phylogenetic relationships are not as well understood. Archaea was selected as the test data set since the evolutionary relationships between some major groups are well-established, whereas others remain contested. Furthermore, many sets of archaeal metabolic genes have a strong phyletic dependence (e.g., methanogenesis among Euryarchaeota; Borrel et al. 2013), therefore facilitating the assessment of shared evolutionary trends driven by similar ecological and/or metabolic requirements. Clustering of homologous proteins was performed using the orthoMCL (Li et al. 2003) implementation available in the GET_HOMOLOGUES package (Contreras-Moreira and Vinuesa 2013). Evolutionary similarity comparisons were restricted to gene families present in at least ten genomes. Pairwise maximum likelihood distances between homologous proteins were generated using IQTree under the LG+G evolutionary model. Phylogenetic trees from clusters of gene families with CES and extended core genome (i.e., single copy and present in at least 35 out of the 42 sampled Archaea genomes) were reconstructed from concatenated multiple sequence alignments using the LG+C60+F+G and individual partitions corresponding to each concatenated gene. Enrichment of gene functions among CES clusters were performed using StringDB API (Szklarczyk et al. 2019). For each genome, homologs from CES gene families were submitted independently for enrichment assessment. Retrieved protein annotations are available in the Supplementary Material online.

Tree-Based Metrics of Evolutionary Similarity

All four tree-based metrics were used to calculate distances between all pairwise combinations of trees reconstructed from simulated alignments. was calculated using the treeCl Python package (Gori et al. 2016), RF was obtained using ETE3, and both and were calculated using treeCmp (Bogdanowicz et al. 2012). Given the nonadditive nature of tree-based metrics and uniform probability of simulated SPR moves across all branches, estimates from tree-based methods showed substantial degrees of saturation when compared with the number of perturbations between simulated gene families. In comparisons presented below we used a transformation that applies, if the approach to the steady state follows an exponential decay: , where and are the adjusted and normalized distance estimates. For sequence divergence, this is known as Poisson Correction (Nei and Zhang 2006).

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.

59 in total

An Efficient, Nonphylogenetic Method for Detecting Genes Sharing Evolutionary Signals in Phylogenomic Data Sets.

Significance

Introduction

Results and Discussion

Simulated Data Set

Robustness Assessment between Approaches

Evolutionary Similarities within Archaeal Gene Families

Clusters of Gene Families with Compatible Evolutionary Signals

Compatible Evolutionary Signal Clusters and Possible Vertical Evolutionary Signals

Common and Distinct Evolutionary Trends between CES Clusters

Consistency of Duplicated Gene Copies within CES Clusters

Conclusions

Materials and Methods

Algorithm Explanation

Statistics and Data Analysis

Data Simulation

Archaeal Empirical Data Set

Tree-Based Metrics of Evolutionary Similarity

Supplementary Material

1. Exploiting the co-evolution of interacting proteins to discover interaction specificity.

2. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.

3. FastSP: linear time calculation of alignment accuracy.

4. A fast algorithm for computing geodesic distances in tree space.

Review 5. Mechanisms of, and barriers to, horizontal gene transfer between bacteria.

6. Biased gene transfer mimics patterns created through shared ancestry.

7. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

8. Search for a 'Tree of Life' in the thicket of the phylogenetic forest.

9. The tree of one percent.

10. INDELible: a flexible simulator of biological sequence evolution.