Fan Song1, Hu Li1, Pei Jiang1, Xuguo Zhou2, Jinpeng Liu3, Changhai Sun4, Alfried P Vogler5, Wanzhi Cai6. 1. Department of Entomology, China Agricultural University, Beijing, China. 2. Department of Entomology, University of Kentucky, Lexington. 3. Markey Cancer Center, University of Kentucky, Lexington. 4. Department of Entomology, Nanjing Agricultural University, Nanjing, China. 5. Department of Life Sciences, Silwood Park Campus, Imperial College London, Ascot, United Kingdom Department of Life Sciences, Natural History Museum, London, United Kingdom caiwz@cau.edu.cn a.vogler@nhm.ac.uk. 6. Department of Entomology, China Agricultural University, Beijing, China caiwz@cau.edu.cn a.vogler@nhm.ac.uk.
Abstract
After decades of debate, a mostly satisfactory resolution of relationships among the 11 recognized holometabolan orders of insects has been reached based on nuclear genes, resolving one of the most substantial branches of the tree-of-life, but the relationships are still not well established with mitochondrial genome data. The main reasons have been the absence of sufficient data in several orders and lack of appropriate phylogenetic methods that avoid the systematic errors from compositional and mutational biases in insect mitochondrial genomes. In this study, we assembled the richest taxon sampling of Holometabola to date (199 species in 11 orders), and analyzed both nucleotide and amino acid data sets using several methods. We find the standard Bayesian inference and maximum-likelihood analyses were strongly affected by systematic biases, but the site-heterogeneous mixture model implemented in PhyloBayes avoided the false grouping of unrelated taxa exhibiting similar base composition and accelerated evolutionary rate. The inclusion of rRNA genes and removal of fast-evolving sites with the observed variability sorting method for identifying sites deviating from the mean rates improved the phylogenetic inferences under a site-heterogeneous model, correctly recovering most deep branches of the Holometabola phylogeny. We suggest that the use of mitochondrial genome data for resolving deep phylogenetic relationships requires an assessment of the potential impact of substitutional saturation and compositional biases through data deletion strategies and by using site-heterogeneous mixture models. Our study suggests a practical approach for how to use densely sampled mitochondrial genome data in phylogenetic analyses.
After decades of debate, a mostly satisfactory resolution of relationships among the 11 recognized holometabolan orders of insects has been reached based on nuclear genes, resolving one of the most substantial branches of the tree-of-life, but the relationships are still not well established with mitochondrial genome data. The main reasons have been the absence of sufficient data in several orders and lack of appropriate phylogenetic methods that avoid the systematic errors from compositional and mutational biases in insect mitochondrial genomes. In this study, we assembled the richest taxon sampling of Holometabola to date (199 species in 11 orders), and analyzed both nucleotide and amino acid data sets using several methods. We find the standard Bayesian inference and maximum-likelihood analyses were strongly affected by systematic biases, but the site-heterogeneous mixture model implemented in PhyloBayes avoided the false grouping of unrelated taxa exhibiting similar base composition and accelerated evolutionary rate. The inclusion of rRNA genes and removal of fast-evolving sites with the observed variability sorting method for identifying sites deviating from the mean rates improved the phylogenetic inferences under a site-heterogeneous model, correctly recovering most deep branches of the Holometabola phylogeny. We suggest that the use of mitochondrial genome data for resolving deep phylogenetic relationships requires an assessment of the potential impact of substitutional saturation and compositional biases through data deletion strategies and by using site-heterogeneous mixture models. Our study suggests a practical approach for how to use densely sampled mitochondrial genome data in phylogenetic analyses.
Mitochondrial genomes have been widely applied to infer intraordinal relationships across insects, which in most instances were found to be congruent with other sources of data and provided convincing levels of support (Cameron et al. 2007; Timmermans et al. 2010, 2015; Wiegmann et al. 2011; Li et al. 2012, 2013; Cameron 2014; Gillett et al. 2014; Wu et al. 2014; Bourguignon et al. 2015). However, for deep relationships mitochondrial genomes are considered ineffective, for example, for the study of interordinal and intraordinal relationships of Insecta and Arthropoda, due to base compositional heterogeneity and variable evolutionary rate that potentially mislead tree inference and frequently result in strong topological conflicts with morphological and nuclear data and high inconsistency among different inference methods (Nardi et al. 2003; Cameron et al. 2004; Sheffield et al. 2009; Talavera and Vila 2011; Simon and Hadrys 2013). These compositional and mutational biases in mitochondrial genomes are in violation of the most widely used substitution models, which accommodate among-site rate variation by applying a GTR (general-time-reversible) model with uniform gamma distribution to all characters and thus accommodate differences in rates (fast or slow sites), but neglect variation in other parameters that may differ among characters (Kolaczkowski and Thornton 2004; Lartillot and Philippe 2004). Accelerated evolutionary rates can lead to systematic errors in the inference of relationships due to long-branch attraction (LBA) (Siddall and Whiting 1999; Brinkmann et al. 2005; Yang and Rannala 2012). In addition, evolutionary models that do not account for compositional heterogeneity can lead to false groupings of unrelated taxa with similar base composition (Tarrío et al. 2001; Sheffield et al. 2009). Recent evolutionary models such as the site-heterogeneous mixture model establish a number of rate categories that each are defined by different equilibrium frequencies of nucleotide or amino acid characters estimated from the empirical data (Lartillot and Phillippe 2004). This approach permits exchangeabilities that differ over sites, and thereby relax the assumption of homogeneity across sites of the standard models. These models apparently reduce the negative effects of compositional and mutational bias (Lartillot et al. 2009; Husník et al. 2011; Morgan et al. 2013; Li et al. 2015; Timmermans et al. 2015).The use of the most accurate models will reduce the probability of systematic errors, as they are able to extract the genuine phylogenetic signals (Delsuc et al. 2005; Jeffroy et al. 2006; Philippe et al. 2011; Morgan et al. 2013; Liu et al. 2014). However, improved models might not hold all the answers to the inconsistency problem, especially when the genuine phylogenetic signal is weak (e.g., for ancient phylogenetic relationships) or the nonphylogenetic signal is predominant (Delsuc et al. 2005; Jeffroy et al. 2006; Philippe et al. 2011). Fast-evolving sites are particularly challenging for phylogenetic reconstruction because they are likely to have experienced multiple substitutions, eroding the genuine signal and misleading phylogenetic inference (Brinkmann and Philippe 1999; Pisani 2004; Delsuc et al. 2005; Goremykin et al. 2013). Various strategies have been suggested to reduce the impact of nonphylogenetic signal, for example, 1) sampling more species to correctly infer multiple substitutions at a site (Baurain et al. 2007; Dunn et al. 2008; Pick et al. 2010; Philippe et al. 2011); 2) excluding the third codon position of protein-coding genes or using RY-coding to decrease saturation and compositional bias (Delsuc et al. 2003; Phillips et al. 2004; Breinholt and Kawahara 2013); and 3) identifying and removing the fast-evolving sites, for example, the slow-fast (SF) method (Kostka et al. 2008) and the observed variability (OV) sorting method (Zhong et al. 2011; Goremykin et al. 2013; Liu et al. 2014).The debate over the utility of mitochondrial genomes in deep relationships of insects is ongoing (Cameron et al. 2004; Talavera and Vila 2011; Simon and Hadrys 2013). However, most previous attempts to reconstruct interordinal relationships using mitochondrial genomes have used site-homogeneous models that do not account for variation in composition or exchange rates across the data (Cameron et al. 2004; Simon and Hadrys 2013). Bayesian site-heterogeneous mixture models have been employed in a few studies, with apparently good success (Talavera and Vila 2011), but it is not clear if they can compensate for the effect of compositional heterogeneity and saturation among sites on phylogenetic inference of deep relationships of insect mitochondrial genomes.Holometabola are the most species-rich lineage of animals on Earth (Kristensen 1999; Trautwein et al. 2012), containing approximately 780,000 described species. Most of the species richness is concentrated in four super-radiations: Coleoptera (beetles), Hymenoptera (bees, ants, and wasps), Lepidoptera (moths and butterflies), and Diptera (true flies) (Grimaldi and Engel 2005), which combined represent about 50% of all known metazoan species (Wilson 1992). A credible resolution of relationships among the 11 recognized holometabolan orders is now in sight, as integrative systematics and independent data types corroborate an increasingly well-supported tree topology (see discussion in Trautwein et al. 2012). Progress has come especially from several recent phylogenetic studies based on multiple nuclear protein-coding genes (Wiegmann et al. 2009; Ishiwata et al. 2010) and large numbers of single-copy orthologs from expressed sequence tags (ESTs) (Meusemann et al. 2010; Letsch et al. 2012), transcriptomes (Misof et al. 2014; Peters et al. 2014), and whole genomes (Niehuis et al. 2012). According to the molecular evidence, and largely consistent with traditional morphological hypotheses (Kristensen 1999; Beutel and Pohl 2006; Wiegmann et al. 2009; Ishiwata et al. 2010; Misof et al. 2014; Peters et al. 2014), Holometabola is divided into three main lineages: Hymenoptera, Neuropteroidea (Neuropterida, Coleoptera, Strepsiptera), and Mecopterida (Lepidoptera, Trichoptera, Diptera, Mecoptera, Siphonaptera) (fig. 1). The basal split of Hymenoptera from all others has been confirmed by most molecular studies (Wiegmann et al. 2009; Ishiwata et al. 2010; Meusemann et al. 2010; Peters et al. 2014; Misof et al. 2014) and recently by morphological analyses (Beutel et al. 2011). The longstanding controversial position of Strepsiptera has also been resolved in favor of a close relationship with Coleoptera by both nuclear genes (Wiegmann et al. 2009; Ishiwata et al. 2010; Longhorn et al. 2010; Niehuis et al. 2012; Misof et al. 2014; Peters et al. 2014) and morphological data (Beutel et al. 2011).
F
Current view of higher level relationships of Holometabola. This tree represents the best recent estimate of holometabolan insect relationships based on nuclear genes (Wiegmann et al. 2009; Misof et al. 2014; Peters et al. 2014). Eight nodes were selected to assess the quality of trees under the different methodological strategies. These uncontroversial relationships are labeled by orange circles with number: 1, the basal split of Hymenoptera from all others; 2, Neuropteroidea + Mecopterida; 3, Neuropteroidea; 4, Coleopterida; 5, Neuropterida; 6, Mecopterida; 7, Antliophora; 8, Amphiesmenoptera.
Current view of higher level relationships of Holometabola. This tree represents the best recent estimate of holometabolan insect relationships based on nuclear genes (Wiegmann et al. 2009; Misof et al. 2014; Peters et al. 2014). Eight nodes were selected to assess the quality of trees under the different methodological strategies. These uncontroversial relationships are labeled by orange circles with number: 1, the basal split of Hymenoptera from all others; 2, Neuropteroidea + Mecopterida; 3, Neuropteroidea; 4, Coleopterida; 5, Neuropterida; 6, Mecopterida; 7, Antliophora; 8, Amphiesmenoptera.Unlike the recent successes of resolving holometabolan relationships with nuclear genes, interordinal studies with mitochondrial genomes are limited, and most of them omit too many orders to be of much comparative value and recovered very few of the widely accepted clades (Cameron et al. 2009; Wei et al. 2010; Talavera and Vila 2011; Kaltenpoth 2012; Wang et al. 2014). Furthermore, mitochondrial genomes of holometabolan insects showed significant lineage-specific variation in base composition and mutational rate, creating phylogenetic error in superimposed substitutions, but these confounding effects may be more manageable as taxon sampling increases. Thus, holometabolan insects constitute an excellent model to evaluate the impact of systematic errors in mitochondrial phylogenomics and to test challenging questions of phylogenetic methodology.To investigate the potential and pitfalls of the mitochondrial phylogenomic data under dense taxon sampling, we conduct a phylogenetic analysis of mitochondrial genome sequences of 199 representative taxa from all 11 holometabolan orders. Eight uncontroversial relationships of deep holometabolan relationships are selected as indicators to test the performance of different phylogenetic methodological strategies (fig. 1). We find that high compositional heterogeneity and saturation of the sequences can lead to strongly supported but incorrect phylogenies under standard “site-homogeneous” models, especially when the phylogenetic signal for a given branch is significantly weaker than the nonphylogenetic signal. We show that nonphylogenetic signal can be reduced by using 1) the site-heterogeneous mixture model, 2) the inclusion of rRNA (ribosomal RNA) genes, and 3) removal of fast-evolving sites. To address the difficult question of the phylogenetic relationships among major insects, and specifically in the Holometabola, all these strategies are required at the same time.
Materials and Methods
Taxon Sampling and Sequence Alignment
We downloaded nucleotide sequences of the 13 protein-coding genes and 2 rRNA genes for 197 Holometabola insects from the NCBI (National Center of Biotechnology Information). To this initial data set, we added newly generated mitochondrial genomes for two Trichoptera species (GenBank accession numbers KF717094 and KF717095) and added 4 Hemiptera species as outgroups, thus generating a data set of 203 taxa (199 Holometabola species and 4 outgroups; see supplementary table S1, Supplementary Material online). The annotated mitochondrial genomes of the two newly sequenced caddisflies were presented in supplementary figure S1, Supplementary Material online.Each protein-coding gene was aligned individually based on codon-based multiple alignments by using the MAFFT algorithm implemented in TranslatorX with the L-INS-i strategy (Abascal et al. 2010). Two rRNA genes were individually aligned using the MAFFT 7.0 online server with the G-INS-i strategy (Katoh and Standley 2013).
Base Composition and Substitution Rate of Protein-Coding Genes
For each species of Holometabola, we concatenated the 13 mitochondrial protein-coding genes in MEGA 6.0 (Tamura et al. 2013) and calculated 1) the overall nucleotide GC% (Guanine-Cytosine) using all 3 codon positions and 2) the frequency of amino acids encoded by GC-rich codons (glycine, alanine, arginine, and proline [GARP]). The nonsynonymous substitution rate (K) was calculated using the DnaSP v.5.0 (Librado and Rozas 2009). We also extracted branch length estimates from the most likely tree after standard Bayesian inference (BI) and maximum-likelihood (ML) analysis.
Phylogenetic Analyses Using Site-Homogeneous Models
Alignments of individual genes were concatenated to generate five 203-taxa data sets: 1) the PCG matrix, including all three codon positions of protein-coding genes (total of 11,169 bp); 2) the PCG12 matrix, including the first and second codon positions of protein-coding genes (total of 7,446 bp); 3) the PCGR matrix, including all three codon positions of protein-coding genes and two rRNA genes (total of 13,799 bp); 4) the PCG12R matrix, including the first and second codon positions of protein-coding genes and two rRNA genes (total of 10,076 bp); and 5) the AA matrix, including amino acid sequences of protein-coding genes (total of 3,723 amino acids). The heterogeneity of sequence divergence within data sets was analyzed using AliGROOVE (Kück et al. 2014) with the default sliding window size. Indels in nucleotide data set were treated as ambiguity and a BLOSUM62 matrix was used as default amino acid substitution matrix. The metric establishes pairwise sequence comparisons of individual terminals or subclades with terminals outside of the focal group. The obtained scoring distance between sequences is then compared with similarity over the entire data matrix. Values can vary between −1 if comparisons are very different from the remainder of the matrix to +1 for comparisons whose score match the average for the entire matrix. This provides an indirect measure of heterogeneity of a given sequence or clade with respect to the full data set.We first analyzed five datasets under both BI and ML frameworks using MrBayes 3.2.3 (Ronquist et al. 2012), PhyloBayes MPI 1.4f (Lartillot et al. 2013), and RAxML-HPC2 8.1.11 (Stamatakis 2006), respectively. Separate partitions were created for each gene in the data set. Bootstrap ML analyses with 1,000 replicates were performed with the fast ML method implemented in RAxML using the GTRGAMMA model for nucleotide data and the MtArt (Abascal et al. 2007) model for amino acid data. For MrBayes analyses, the best-fit model of nucleotide sequences of each gene was determined with jModelTest 0.1.1 (Posada 2008) according to the Akaike Information Criterion (AIC) (supplementary table S2, Supplementary Material online). Two simultaneous runs of 6–20 million generations were conducted for the matrix and trees were sampled every 1,000 generations, with the first 25% discarded as burn-in. Stationarity was considered to be reached when the average standard deviation of split frequencies was below 0.01 (Huelsenbeck et al. 2001). The amino acid data set was analyzed using PhyloBayes MPI with the MtArt model. Two independent chains starting from a random tree were run for 30,000 cycles, with trees being sampled at every cycle. The initial 7,500 trees of each Markov chain Monte Carlo (MCMC) run were discarded as burn-in. A consensus tree was computed from the remaining 45,000 trees combined from two runs.Three data sets (PCG, PCGR and AA) were also used to test the different partitioning schemes for ML and BI methods. The optimal partitioning scheme and substitution model was selected by PartitionFinder v1.1.1 (Lanfear et al. 2012). We created input configuration files that contained different predefined partitions for each data set: 1) 13 gene partitions for PCG; 2) 39 codon partitions for PCG; 3) 15 gene partitions for PCGR; 4) 41 partitions for PCGR (39 codon positions for PCGs and 2 partitions for rRNAs); and 5) 13 gene partitions for AA. We used the “greedy” algorithm with branch length estimated as “unlinked” and AIC criteria to search for the best-fit partitioning scheme and substitution model. The best selected partitioning schemes and models of three data sets for ML analyses were listed in supplementary table S3, Supplementary Material online. Partitioned ML analyses were conducted using RAxML.
Phylogenetic Analyses based on Site-Heterogeneous Models
Ten-fold Bayesian cross-validation was performed to test the fit of the site-heterogeneous mixture models (CAT and CAT + GTR) and “site-homogeneous” models (GTR and MtArt) to our complete nucleotide (PCG and PCGR) and amino acid (AA) data sets using PhyloBayes 3.3f (Lartillot et al. 2009) (see PhyloBayes manual). The result of cross-validation showed that the CAT + GTR model was the best fitting model for all data sets (supplementary table S4, Supplementary Material online).We then inferred phylogenies from the AA, PCG, PCG12, PCG-RY (protein-coding genes with third codon positions R-Y coded), PCGR, PCG12R, and PCGR-RY (the third codon position of protein-coding genes R-Y coded) data sets using PhyloBayes MPI 1.4f (Lartillot et al. 2013), with the CAT + GTR model. In each individual analysis, two independent chains starting from a random tree were run for 30,000 cycles, with trees being sampled at every cycle. The initial 7,500 trees of each MCMC run were discarded as burn-in. A consensus tree was computed from the remaining 45,000 trees combined from 2 runs. All phylogenetic analyses were carried out on the CIPRES Science Gateway (Miller et al. 2010) and the High Performance Computing Cluster at the University of Kentucky Analytics and Technologies.
Model-based Saturation Plots and Posterior Predictive Analyses of Sequence Homoplasy
For the saturation plots, the overall best fitting CAT + GTR model was selected as a reference model. Patristic distances derived from trees obtained under site-homogeneous models or using the observed distances (uncorrected P-distances) were plotted against the CAT + GTR distances, and the slope of the regression line in the plot was used as a measure of saturation (relative to the divergences estimated from the rate-heterogeneous model). Patristic distances were generated using PATRISTIC (Fourment and Gibbs 2006). Posterior predictive analysis implemented in PhyloBayes 3.3f (Lartillot et al. 2009) was also used to compare the ability of alternative models to estimate the homoplasy in our data sets.
Exploration of the Signal in the Nucleotide and Amino Acid Data Sets
To explore the phylogenetic signal in the nucleotide and amino acid data sets, we excluded the fast-evolving sites using SlowFaster (Kostka et al. 2008). To assign substitution rates to individual positions, eight widely recognized groups (Hymenoptera, Megaloptera, Neuroptera, Strepsiptera, Coleoptera, Trichoptera, Lepidoptera, and Diptera) were chosen. Positions with the highest rates were gradually excluded and new restricted sub-data sets were produced. The nucleotide and amino acid sub-data sets were analyzed with PhyloBayes under the CAT + GTR model.
Phylogenetic Error Reduction by OV-Sorting Method
By using OV sorting (Goremykin et al. 2010, 2013), the full data set was ordered from the most variable sites to the most conserved sites, and a series of alignments was generated by successively shortening the OV-sorted alignment in steps of 500 sites. At each round of data removal, two data partitions were obtained: 1) an “A” partition, which includes sites from the conserved end of the alignment that are left after the iterative removal of the fastest sites, and 2) a “B” partition, which includes the variable sites removed in the current round. After model fitting was applied to each partition using ModelTest (Posada and Crandall 1998), the ML distance and uncorrected p distance were calculated using PAUP* (Swofford 2002). Correlation analyses were conducted at each shortening step to establish 1) the correlation of the ML and uncorrected p distances for partition B, testing for the change in this correlation as uncorrected p distances increasingly fail to capture the true divergence in the most variable data, unlike the model-based estimates; and 2) the correlation of the ML distances on pairs of taxa for partition A and B, in the expectation that both partitions produce roughly similar distances, unless the two partitions differ in their rates, in which case the B partition should be removed. The stopping point for site removal was determined as the point at which the two correlations showed a significant improvement (also named the GNB criterion in Goremykin et al. 2013). To reduce computation time for PAUP tree-building analyses on a 24-core Linux server, a small taxon selection (36-taxa) from the full 203-taxa PCGR data set was used in the OV-sorting analyses. The 36-taxa data set included species from all 11 holometabolan insect orders, and we fully considered the diversity of their branch length and A + T content to simulate the complicacy of the full 203-taxa data set. Finally, the OV-sorted data set selected by the GNB criterion was analyzed with PhyloBayes under the CAT + GTR model.
Results and Discussion
High Degree of Compositional Heterogeneity
We explored the compositional diversity of both nucleotides and amino acids of mitochondrial protein-coding genes across holometabolan orders (fig. 2). Sequences of Hymenoptera and Strepsiptera were more extremely A + T rich and low in the GC-encoding GARP amino acids than other orders. Among the three orders of Neuropterida, sequences of Raphidioptera were more A + T rich than Neuroptera and Megaloptera. In Mecopterida, sequences of Lepidoptera, Trichoptera, and Siphonaptera were more A + T rich than Diptera and Mecoptera. Five orders (Hymenoptera, Strepsiptera, Diptera, Mecoptera, and Coleoptera) showed high compositional bias at the intraordinal level; for example, sequences of two gall midges (Diptera: Cecidomyiidae) were more A + T rich than other species of Diptera (supplementary table S5, Supplementary Material online). Our observation showed a high degree of compositional heterogeneity among holometabolan mitochondrial genomes in both nucleotide and amino acid level and such variability is known as the source of systematic errors in phylogenetic reconstructions (Sheffield et al. 2009; Rota-Stabelli et al. 2010).
F
Compositional properties of holometabolan mitochondrial protein-coding genes. The G + C content of the concatenated alignment is plotted against the percentage of amino acids encoded by G- and C-rich codons (GARP). Values are averaged for orders, with standard deviations indicated.
Compositional properties of holometabolan mitochondrial protein-coding genes. The G + C content of the concatenated alignment is plotted against the percentage of amino acids encoded by G- and C-rich codons (GARP). Values are averaged for orders, with standard deviations indicated.
Contrasting Rates of Evolution in the Mitochondrial Genome of Holometabola
We measured K for each taxon included in our study in comparison with the outgroup (supplementary table S5, Supplementary Material online). These comparisons showed that K is low for Mecoptera (0.24–0.27), Neuroptera (0.25–0.26), Megaloptera (0.25–0.27), Siphonaptera (0.28), Trichoptera (0.29 and 0.30), Raphidioptera (0.29), Coleoptera (0.24–0.33), Diptera (0.23–0.32), and Lepidoptera (0.23–0.31), and generally higher for Hymenoptera (0.30–0.49) and Strepsiptera (0.37 and 0.45). In Hymenoptera, the K of a sawfly Monocellicampa pruni (Wei et al. 2015) (Tenthredinidae) (0.30) was also markedly lower than the other species. Comparison of branch lengths in the phylogenetic tree showed a similar trend, and a positive correlation was observed between K and branch length (R2 = 0.88) (supplementary table S5, Supplementary Material online). Overall, these findings demonstrate the lineage-specific substitution rate and contrasting rate of mitochondrial genome evolution both across and within Holometabola orders, especially a significantly accelerated evolutionary rate in Hymenoptera and Strepsiptera. Accelerated substitution rates may play a role in masking and eroding phylogenetic signal through unrecognized homoplasy and lead to increased susceptibility to systematic bias, such as LBA (Bergsten 2005; Rota-Stabelli et al. 2010; Talavera and Vila 2011).
Heterogeneous Sequence Divergence within Holometabola Mitochondrial Genomes
The AliGROOVE procedure provides a measure of heterogeneity of sequence divergence by conducting pairwise comparisons of nucleotide divergences for each terminal or set of terminals defined by an internal node against all other sequences in a multiple sequence alignment (Kück et al. 2014). This analysis found strong heterogeneity in sequence divergence for a subset of taxa (fig. 3 and supplementary fig. S2, Supplementary Material online). In particular, pairwise sequence comparisons involving Hymenoptera (excluding Tenthredinidae), Strepsiptera, and two gall midges (Diptera: Cecidomyiidae) received lower similarity scores than pairwise comparisons between other sequences. The divergence of these taxa may indicate that species in these groups cannot be robustly placed in the phylogenetic tree or may be misplaced. The heterogeneity was strongest for data sets PCG and PCGR that include all nucleotide positions, compared with the PCG12, PCG12R, and AA data sets, indicating that third codon positions are greatly more rate-heterogeneous than the first and second codon positions and consistently scored negative in the AliGROOVE pairwise comparisons (supplementary fig. S2, Supplementary Material online). Data sets with the third codon position excluded (PCG12 and PCG12R) therefore reduced the degrees of random sequence similarity and sequence heterogeneity.
F
AliGROOVE analysis for four data sets. The mean similarity score between sequences is represented by a colored square, based on AliGROOVE scores from −1, indicating great difference in rates from the remainder of the data set, that is, heterogeneity (red coloring), to +1, indicating that rates match all other comparisons (blue coloring).
AliGROOVE analysis for four data sets. The mean similarity score between sequences is represented by a colored square, based on AliGROOVE scores from −1, indicating great difference in rates from the remainder of the data set, that is, heterogeneity (red coloring), to +1, indicating that rates match all other comparisons (blue coloring).
Systematic Errors in Standard Bayesian Inference and Maximum-Likelihood Analysis
Standard BI and ML analysis were conducted for each of the five 203-taxa data sets (PCG, PCG12, PCGR, PCG12R, and AA). These analyses using site-homogenous models based on empirical frequencies of amino acid or nucleotide substitutions (including MtArt or GTR-based models) produced contradictory topologies (supplementary figs. S3 and S4, Supplementary Material online). Bayesian analyses of three data sets (BI-PCG, BI-PCG12, and BI-PCG12R) recovered two uncontroversial relationships, Amphiesmenoptera and Antliophora, and only Amphiesmenoptera was supported by other standard BI and ML analyses. ML analyses of three data sets (PCG, PCGR, and AA) with optimized partition schemes could not improve the results recovering one uncontroversial relationship at most (supplementary fig. S5, Supplementary Material online). There is no significant difference in topology or nodal support between different partitioning schemes. In general, the unexpected grouping of two gall midges (Diptera: Cecidomyiidae), Strepsiptera and Hymenoptera (excluding Tenthredinidae), was highly supported by most methods (fig. 4 and supplementary figs. S3 and S4, Supplementary Material online). After mapping the values of K, branch lengths, and A + T content onto the phylogenetic tree, we found that species in this clade showed higher values for these parameters than other holometabolan insects (fig. 4), and the similarity scores showed high divergence to all other species displayed in the AliGROOVE analyses (fig. 3 and supplementary fig. S2, Supplementary Material online). These results suggest that phylogenetic artifacts are responsible for recovery of this spurious clade.
F
Systematic errors in the standard phylogenetic analyses under site-homogeneous model. The tree is obtained by Bayesian analysis of nucleotide sequences of protein-coding genes (BI-PCG) under site-homogeneous models. Orange circles with number indicate recovered uncontroversial relationships in figure 1. The unexpected clade caused by accelerated substitution rates and compositional heterogeneity of holometabolan mitochondrial genomes is highlighted by a dotted line box. Error bars represent standard deviations from data of multiple species.
Systematic errors in the standard phylogenetic analyses under site-homogeneous model. The tree is obtained by Bayesian analysis of nucleotide sequences of protein-coding genes (BI-PCG) under site-homogeneous models. Orange circles with number indicate recovered uncontroversial relationships in figure 1. The unexpected clade caused by accelerated substitution rates and compositional heterogeneity of holometabolan mitochondrial genomes is highlighted by a dotted line box. Error bars represent standard deviations from data of multiple species.
Phylogenetic Results under Site-Heterogeneous Models
Bayesian (PhyloBayes) analyses under the CAT + GTR model from the AA and PCG data sets recovered three uncontroversial relationships, Neuropterida (BPP [Bayesian posterior probabilities] = 0.65 and 0.99), Amphiesmenoptera (BPP = 1.0), and Antliophora (BPP = 0.98 and 0.77) (supplementary fig. S6, Supplementary Material online). Artifactual clades similar to those in the standard Bayesian and ML analyses were also found in these two analyses, for example, the grouping of two gall midges (Diptera: Cecidomyiidae), Strepsiptera, and most Hymenoptera in the AA data set, and the sister relationship between Coleoptera and Strepsiptera plus most Hymenoptera in the PCG data set. Removal of third codon positions (PCG12) or RY-coding (PCG-RY) could break up these artifactual groupings, but the positions of Strepsiptera, Amphiesmenoptera, and Hymenoptera were still unresolved (supplementary fig. S6, Supplementary Material online). These results indicate that when the CAT + GTR model is used, RY-coding and removal of third codon positions can reduce the effect of systematic bias. However, the effect of these strategies was insufficient to recover the correct relationships.Further improvements were obtained when also including the rRNA genes (BI-PCGR), which recovered the basal split of Hymenoptera from all others (BPP = 1.0), and the monophyly of Neuropteroidea (BPP = 0.75), Coleopterida (BPP = 0.78), Neuropterida (BPP = 1.0), and Amphiesmenoptera (BPP = 0.99) (fig. 5A). PhyloBayes analysis of the two rRNA genes alone (BI-RNA) recovered the monophyly of Amphiesmenoptera (BPP = 0.96), Coleopterida (BPP = 0.89; although Strepsiptera was placed within Coleoptera), and Diptera (BPP = 0.95), and the artifactual clades common to the analyses of AA and PCG data sets were not found (supplementary fig. S6, Supplementary Material online). Thus, the inclusion of the rRNA genes reduced the impact of systematic errors to a certain extent. In combination with RY-coding (PCGR-RY), we found the above five relationships (BPP ≥ 0.75) and further recovered the monophyly of Mecopterida (BPP = 0.76) and the sister relationship of Neuropteroidea and Mecopterida (BPP = 0.65) (fig. 5B). Finally, when applying BI-PCG12R, we obtained the best tree topology of deep relationships of Holometabola consistent with accepted nuclear data and morphology-based hypotheses: (Hymenoptera + ((Coleopterida + Neuropterida) + (Amphiesmenoptera + Antliophora))) (fig. 5C and supplementary fig. S7, Supplementary Material online).
F
Holometabolan phylogenies inferred from the combined protein-coding genes and rRNA gens using PhyloBayes with the CAT + GTR model. (A) Bayesian tree from the data set PCGR under the CAT + GTR model. (B) Bayesian tree from the data set PCGR-RY under the CAT + GTR model. (C) Bayesian tree from the data set PCG12R under the CAT + GTR model. We show a schematic version of the Bayesian trees with some lineages collapsed for clarity. Supports at nodes are Bayesian posterior probabilities. Orange circles with number indicate recovered uncontroversial relationships in figure 1.
Holometabolan phylogenies inferred from the combined protein-coding genes and rRNA gens using PhyloBayes with the CAT + GTR model. (A) Bayesian tree from the data set PCGR under the CAT + GTR model. (B) Bayesian tree from the data set PCGR-RY under the CAT + GTR model. (C) Bayesian tree from the data set PCG12R under the CAT + GTR model. We show a schematic version of the Bayesian trees with some lineages collapsed for clarity. Supports at nodes are Bayesian posterior probabilities. Orange circles with number indicate recovered uncontroversial relationships in figure 1.
Improvements Using Site-Heterogeneous Models
Sequence saturation and compositional heterogeneity are two frequent causes of phylogenetic artifacts (Philippe et al. 2011; Rota-Stabelli et al. 2013). Inferences of ancient nodes are hampered by multiple superimposed substitutions (homoplasy) and therefore depend on the accuracy of the model of sequence evolution, to avoid spurious convergences (Lartillot and Philippe 2004; Sperling et al. 2009; Morgan et al. 2013). We measured how well various models estimate sequence saturation and homoplasy. Posterior predictive analysis showed that MtArt and GTR models inferred a much lower level of homoplasy in the amino acid (AA) and nucleotide (PCG and PCGR) data sets compared to the CAT + GTR model (supplementary table S6, Supplementary Material online). Using the CAT + GTR model, predicted homoplasies of the data sets PCG (251.59) and PCGR (220.51) were much higher than after removal of third positions (PCG12, 92.25; PCG12R, 119.76) or RY-coding (PCG-RY, 117.46; PCGR-RY, 111.62) and amino acid coding (AA, 92.32) (supplementary table S6, Supplementary Material online).These results were confirmed based on the level of sequence divergence assessed by plotting patristic distances in standard and CAT + GTR models. The slope of these plots was extremely low for data sets that include all nucleotide positions under the standard GTR model (0.0743 for PCG and 0.0902 for PCGR) (fig. 6A) and amino acids under the MtArt model (0.2323 for AA) (fig. 6B). The plots of uncorrected p-distance against the distances from the CAT-GTR model still showed high saturation after removal of third positions and RY-coding (slope of PCG12 = 0.0187 and PCG-RY = 0.0184), whereby the inclusion of the rRNA genes clearly reduced the level of saturation shown for PCGR-RY (0.0227; fig. 6C) and PCG12R (0.0211; fig. 6D). Thus, removal of third positions or RY-coding greatly reduced saturation against the full PCG data set (0.0092), while AA recoding had an even greater effect. These results indicate the difficulty of estimating the correct rates due to saturation of nucleotide changes under the GTR and MtArt models, which may account for spurious groups exhibiting high A + T content and fast evolutionary rate (e.g., the grouping of Cecidomyiidae, Strepsiptera, and most Hymenoptera), while the site-heterogeneous model are less affected and resolve the tree correctly, in particular once the third positions have been removed.
F
Model-based saturation plots for the amino acid and nucleotide data sets. (A) Plots of the patristic distances of all data (AA, PCG, and PCGR) estimated from the CAT + GTR tree compared with the distances from the “site-homogeneous” MtArt and GTR-based models. Plots of the observed distances (uncorrected P-distances) against distance estimated from the CAT + GTR tree, using (B) all data, (C) all data after RY coding, and (D) first and second positions only.
Model-based saturation plots for the amino acid and nucleotide data sets. (A) Plots of the patristic distances of all data (AA, PCG, and PCGR) estimated from the CAT + GTR tree compared with the distances from the “site-homogeneous” MtArt and GTR-based models. Plots of the observed distances (uncorrected P-distances) against distance estimated from the CAT + GTR tree, using (B) all data, (C) all data after RY coding, and (D) first and second positions only.
Phylogenetic Signal in the Nucleotide and Amino Acid Data Sets of Mitochondrial Genomes
As homoplasy and degree of saturation varied greatly among classes of nucleotides and had great effects on the phylogenetic trees even under the CAT + GTR model, we attempted to dissect the role of variable rates among nucleotides by gradually excluding an ever-larger proportion of the fastest evolving sites using SlowFaster (Kostka et al. 2008). Based on the best fitting CAT + GTR model and sub-data sets of PCGR, this analysis showed that signal supporting the basal split of Hymenoptera, and the monophyly of Coleopterida, Neuropterida, and Amphiesmenoptera was stable under any level of data removal (fig. 7A). However, the monophyly of Mecopterida and the sister relationship of Neuropteroidea and Mecopterida were recovered only over a small window of data exclusion, after approximately 19% of the fastest evolving sites were excluded, but lost again after exclusion of approximately 32% or more of sites (fig. 7B). Signal for Neuropteroidea was also lost after exclusion of approximately 53% of fastest evolving sites. Seven of eight selected uncontroversial relationships were recovered with high support (BPP > 0.85) and only the monophyletic Antliophora was not recovered by SF analyses of the PCGR data set. The best SF analyses of AA and PCG recovered few selected uncontroversial relationships, with three relationships recovered in the complete AA data and four relationships supported by the PCG after excluding approximately 47% of fastest evolving sites (supplementary fig. S8, Supplementary Material online). Overall, the SF analyses obtained better topologies with the PCGR rather than the PCG and AA data sets, as inclusion of the rRNA genes apparently increased the amount of phylogenetic signal and the removal of fastest evolving sites further decreased the proportion of spurious variation. However, phylogenetic signal supporting different deep relationships was derived from sites with various evolutionary rates, and therefore the removal of the fastest evolving sites also caused the loss of phylogenetic signal and the increased effect of competing signal for some nodes, which resulted in incorrect topology or low support levels (e.g., the low support for nodes Neuropteroidea + Mecopterida, Mecopterida, and Antliophora in the PhyloBayes tree of PCG12R; fig. 5C).
F
Slow-fast analyses of the nucleotide data set of the combined protein-coding genes and rRNA genes. (A) Posterior probabilities using Bayesian CAT + GTR model for various sub-data sets deprived of classes of fast-evolving sites in the data set PCGR (as indicated by the amount of sites left in the data sets). Eight uncontroversial relationships in figure 1 (orange circles) are selected as indicators to test the phylogenetic signals in the data sets. (B) Holometabolan phylogeny inferred from the data set PCGR with approximately 19% fastest evolving sites excluded using PhyloBayes under the CAT + GTR model. We show a schematic version of the Bayesian trees with some lineages collapsed for clarity.
Slow-fast analyses of the nucleotide data set of the combined protein-coding genes and rRNA genes. (A) Posterior probabilities using Bayesian CAT + GTR model for various sub-data sets deprived of classes of fast-evolving sites in the data set PCGR (as indicated by the amount of sites left in the data sets). Eight uncontroversial relationships in figure 1 (orange circles) are selected as indicators to test the phylogenetic signals in the data sets. (B) Holometabolan phylogeny inferred from the data set PCGR with approximately 19% fastest evolving sites excluded using PhyloBayes under the CAT + GTR model. We show a schematic version of the Bayesian trees with some lineages collapsed for clarity.
Applying an A Priori Criterion for the Removal of Most Fast-Evolving Sites
The cut-off for removal of fast sites is a subjective choice unless an a priori criterion can be applied for determining what sites should be targeted. The OV-sorting method reorders the sites in the alignment according to their observed rates, which is assessed by a statistical analysis identifying the most deviating sites in the matrix relative to the rates in the other sites and thus provides an objective stopping point for data removal (Zhong et al. 2011, 2014; Goremykin et al. 2013). This method was applied to the 13,799 sites of the PCGR data set by removing sites in increments of 500 nt. The optimal number of sites retained (GNB criterion) was 11,799 (2,000 sites removed from the full data set), identified by significant improvement in the correlation analysis that compares the variability in the removed fraction (partition B) and the retained data (partition A) in the iterative data reduction procedure (fig. 8). When the OV-sorted data set was subjected to PhyloBayes, seven of the eight selected uncontroversial relationships were recovered with high support (BPP ≥ 0.95) although the monophyletic Antliophora was not recovered (fig. 9 and supplementary fig. S9, Supplementary Material online). The topology of interordinal relationships was similar to the best SF analysis of PCGR data set, but has higher support levels for the recovered nodes. Within insect orders, many well-resolved relationships were also recovered, for example, 1) the paraphyletic Symphyta and the monophyletic Apocrita in Hymenoptera (Mao et al. 2015); 2) a robust intraordinal relationships in Lepidoptera that is in line with the recent phylogenomic study based on 2,696 nuclear genes (Kawahara and Breinholt 2014); 3) the monophyletic groups including Cyclorrhapha, Brachycera, and Neodiptera (placing Bibionomorpha as a sister group to Brachycera) in Diptera (Wiegmann et al. 2011); and 4) the monophyly of Polyphaga and Adephaga, and the sister relationship between Scirtidae and the remaining Polyphaga in Coleoptera (Timmermans et al. 2015). This indicates that phylogenetic error can be reduced by OV sorting, and the GNB criterion is suitable for identifying erroneous signal from fast-evolving sites and for selecting the appropriate data for mitochondrial phylogenomic studies. Although this criterion is applied in insect mitochondrial genomic data for the first time, we suggest that more studies in different insect groups are necessary to test the reliability of this method.
F
Results of OV analysis. (A) Plot showing results of Pearson correlation analyses. The green dotted line indicates the Pearson correlation coefficients (r) of ML distances for A partitions (the more conserved) and B partitions (less conserved). The orange dotted line represents r value of uncorrected p-distances and ML distances for B partitions. The r values begin to increase sharply at the forth OV-shortening step of the PCGR data set (11,799 position remained). (B) Plot showing mean deviations between ML and p distances for B partitions. In calculating ML distances, the best-fitting ML model for each partition was first determined under the AIC using ModelTest (Posada and Crandall 1998). The orange dotted line indicates results from analyses using a neighbor-joining tree to fit ML model parameters. The green dotted line indicates results obtained when an ML tree is used to fit substitution model parameters.
F
Holometabolan phylogenies inferred from the OV-sorted PCGR data set using PhyloBayes with the CAT + GTR model. The OV-sorted PCGR data set (11,799 bp) was selected by the GNB criterion (fig. 8). We show a schematic version of the Bayesian trees with some lineages collapsed for clarity and the full tree with branch lengths can be inspected in supplementary figure S9, Supplementary Material online. Bracket with number indicates the number of sampled species in a family. Supports at nodes are Bayesian posterior probabilities. Orange circles with number indicate recovered uncontroversial relationships in figure 1.
Results of OV analysis. (A) Plot showing results of Pearson correlation analyses. The green dotted line indicates the Pearson correlation coefficients (r) of ML distances for A partitions (the more conserved) and B partitions (less conserved). The orange dotted line represents r value of uncorrected p-distances and ML distances for B partitions. The r values begin to increase sharply at the forth OV-shortening step of the PCGR data set (11,799 position remained). (B) Plot showing mean deviations between ML and p distances for B partitions. In calculating ML distances, the best-fitting ML model for each partition was first determined under the AIC using ModelTest (Posada and Crandall 1998). The orange dotted line indicates results from analyses using a neighbor-joining tree to fit ML model parameters. The green dotted line indicates results obtained when an ML tree is used to fit substitution model parameters.Holometabolan phylogenies inferred from the OV-sorted PCGR data set using PhyloBayes with the CAT + GTR model. The OV-sorted PCGR data set (11,799 bp) was selected by the GNB criterion (fig. 8). We show a schematic version of the Bayesian trees with some lineages collapsed for clarity and the full tree with branch lengths can be inspected in supplementary figure S9, Supplementary Material online. Bracket with number indicates the number of sampled species in a family. Supports at nodes are Bayesian posterior probabilities. Orange circles with number indicate recovered uncontroversial relationships in figure 1.
Methodological Implications for Mitochondrial Phylogenomics of Deep Insect Relationships
It is well known that insect mitochondrial genomes display strong base compositional and mutational rate heterogeneity, variation among different genes, among codon positions within a gene, and among different taxonomic levels (Sheffield et al. 2009; Castellana et al. 2011; Bernt et al. 2013; Li et al. 2013; Cameron 2014), as shown for the holometabolan insects in this study. Such heterogeneities violate the stationarity assumption of the widely used site-homogeneous models of nucleotide substitution (Rosenberg and Kumar 2003; Kolaczkowski and Thornton 2004; Lartillot and Philippe 2004; Hassanin 2006). Site-homogeneous models assume the same evolutionary process for every site of the data set (only the evolutionary rate can be modeled as heterogeneous across sites, usually through a gamma distribution of rates), although evolutionary processes are known to be heterogeneous across positions (Philippe et al. 2011). The CAT + GTR in PhyloBayes is better suited to accommodate this variation in the evolutionary process across sites (Lartillot and Philippe 2004; Philippe et al. 2011). This model establishes k profiles of equilibrium frequencies combined with general exchange rates (using a Dirichlet process to describe the likelihood of the distribution of discrete categories) (Lartillot and Philippe 2004; Lartillot et al. 2009). In this study, the better fit of the site-heterogeneous mixture model to holometabolan mitochondrial phylogenomic data sets was confirmed by analyses of cross-validation, model-based saturation plots, and posterior predictive simulation. The method also seems well suited to deal with different types of data composed of protein-coding and rRNA genes. Other studies have also demonstrated that these models provide a better fit to phylogenomic data and tend to reduce tree reconstruction artifacts (Sperling et al. 2009; Rota-Stabelli et al. 2010; Husník et al. 2011; Morgan et al. 2013; Li et al. 2015; Timmermans et al. 2015). We show the power of this approach for resolving the tree of Holometabola, indicating that model adequacy is critical for accurate tree reconstruction in mitochondrial phylogenomic studies.Yet, the site-heterogeneous model is still affected by the biases in the data driven by saturation of variation and high levels of homoplasy. The rapidly evolving sites in mitochondrial genomes and in particular the third codon position of protein-coding genes are expected to be the most heterogeneous in composition and saturated in substitutions, and often contribute to various phylogenetic artifacts (Pisani 2004; Rota-Stabelli et al. 2010). Removal of these sites or the use of amino acid data is considered an effective method dealing with systematic errors (Kostka et al. 2008; Zhong et al. 2011; Cameron 2014; Liu et al. 2014). Our analyses confirmed that these strategies could significantly reduce the sequence saturation of different data sets (e.g., in AA, PCG, and PCGR). However, the genuine phylogenetic signal for ancient phylogenetic relationships is always weak and differs between protein-coding and rRNA genes. The inclusion of two rRNA sequences is helpful in increasing the amounts of phylogenetic signal and the signal-to-noise ratio. Furthermore, the site-heterogeneous models demand more data to correctly estimate the distribution of site-specific effects and to discriminate among competing phylogenetic hypotheses (Quang et al. 2008; Morgan et al. 2013). Thus adding the two rRNA genes to the protein-coding genes improved the topology of the tree by adding more phylogenetic information.Based on the analyses with the site-heterogeneous model, we show that a part of fast-evolving sites should be discarded from phylogenetic analyses to resolve the deepest nodes in holometabolan phylogeny. Exclusion of these sites significantly reduced the phylogenetic errors and actually generated the best topological resolution. However, nodal support of some clades was also reduced. These results indicate that these fast-evolving sites include not only the majority of misleading signal, but also contribute valuable phylogenetic information. Accordingly, our results suggest that mitochondrial phylogenomic studies of basal relationships of insects require rigorous analyses with suitable evolutionary models and careful evaluation of which data to include. Therefore, an accurate method for detecting and removing the part of data containing a high level of nonphylogenetic signal is important for mitochondrial phylogenomic studies. Our exploratory analysis indicates that the GNB criterion of the OV-sorting method (Goremykin et al. 2010, 2013) is suitable for identifying sites most affected by multiple substitutions and it could be useful for other insect groups and data sets with similar properties.Phylogenomics, the inference of phylogenetic relationships using genome scale data (from EST, transcriptome, and whole-genome sequences), have shown their power for assembling the tree-of-life (Philippe et al. 2009; Jarvis et al. 2014; Misof et al. 2014). However, systematic error resulting from nonphylogenetic signal is not expected to disappear with the addition of large amounts of data (Delsuc et al. 2005; Jeffroy et al. 2006; Philippe et al. 2011). In addition, the requirement for living materials is still a major limitation to transcriptomic analyses. With next-generation sequencing technology, hundreds or even thousands of mitochondrial genomes can be efficiently and economically obtained from a pooled mixture of DNA extracts (Gillett et al. 2014; Tang et al. 2014). As little as 1 ng of genomic DNA from each species is sufficient for pooling and many degraded voucher specimens of rare species in museum collections can also be suitable for sequencing (Timmermans et al. 2016). The ease of sequencing, the feasibility of large taxon sampling, and other advantages of mitochondrial genomes (easy alignment, straightforward gene orthology, etc.) remain major reasons for its continued use in the phylogenomic era.In order to effectively use mitochondrial genome data to correctly resolve difficult phylogenetic questions, particularly the ancient divergences, inferences from mitochondrial genomic data should always assess the possible impact of substitutional saturation and compositional biases whose effect is not independent but potentially correlated. We therefore suggest that phylogenetic hypotheses inferred from mitochondrial genomic data be interpreted with caution, even when highly supported, until the effects of systematic errors are fully assessed through data deletion strategies (e.g., OV-sorting method), and by using site-heterogeneous mixture models.
Supplementary Material
Supplementary tables S1–S6 and figures S1–S9 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).
Authors: Henner Brinkmann; Mark van der Giezen; Yan Zhou; Gaëtan Poncelin de Raucourt; Hervé Philippe Journal: Syst Biol Date: 2005-10 Impact factor: 15.683
Authors: Casey W Dunn; Andreas Hejnol; David Q Matus; Kevin Pang; William E Browne; Stephen A Smith; Elaine Seaver; Greg W Rouse; Matthias Obst; Gregory D Edgecombe; Martin V Sørensen; Steven H D Haddock; Andreas Schmidt-Rhaesa; Akiko Okusu; Reinhardt Møbjerg Kristensen; Ward C Wheeler; Mark Q Martindale; Gonzalo Giribet Journal: Nature Date: 2008-03-05 Impact factor: 49.962
Authors: K S Pick; H Philippe; F Schreiber; D Erpenbeck; D J Jackson; P Wrede; M Wiens; A Alié; B Morgenstern; M Manuel; G Wörheide Journal: Mol Biol Evol Date: 2010-04-08 Impact factor: 16.240
Authors: M J T N Timmermans; S Dodsworth; C L Culverwell; L Bocak; D Ahrens; D T J Littlewood; J Pons; A P Vogler Journal: Nucleic Acids Res Date: 2010-09-28 Impact factor: 16.971
Authors: Martin Kaltenpoth; Patrice Showers Corneli; Diane M Dunn; Robert B Weiss; Erhard Strohm; Jon Seger Journal: PLoS One Date: 2012-03-06 Impact factor: 3.240
Authors: Hu Li; John M Leavengood; Eric G Chapman; Daniel Burkhardt; Fan Song; Pei Jiang; Jinpeng Liu; Xuguo Zhou; Wanzhi Cai Journal: Proc Biol Sci Date: 2017-09-13 Impact factor: 5.349