Literature DB >> 19242549

Universal artifacts affect the branching of phylogenetic trees, not universal scaling laws.

Abstract

BACKGROUND: The superficial resemblance of phylogenetic trees to other branching structures allows searching for macroevolutionary patterns. However, such trees are just statistical inferences of particular historical events. Recent meta-analyses report finding regularities in the branching pattern of phylogenetic trees. But is this supported by evidence, or are such regularities just methodological artifacts? If so, is there any signal in a phylogeny?
METHODOLOGY: In order to evaluate the impact of polytomies and imbalance on tree shape, the distribution of all binary and polytomic trees of up to 7 taxa was assessed in tree-shape space. The relationship between the proportion of outgroups and the amount of imbalance introduced with them was assessed applying four different tree-building methods to 100 combinations from a set of 10 ingroup and 9 outgroup species, and performing covariance analyses. The relevance of this analysis was explored taking 61 published phylogenies, based on nucleic acid sequences and involving various taxa, taxonomic levels, and tree-building methods. PRINCIPAL
FINDINGS: All methods of phylogenetic inference are quite sensitive to the artifacts introduced by outgroups. However, published phylogenies appear to be subject to a rather effective, albeit rather intuitive control against such artifacts. The data and methods used to build phylogenetic trees are varied, so any meta-analysis is subject to pitfalls due to their uneven intrinsic merits, which translate into artifacts in tree shape. The binary branching pattern is an imposition of methods, and seldom reflects true relationships in intraspecific analyses, yielding artifactual polytomies in short trees. Above the species level, the departure of real trees from simplistic random models is caused at least by two natural factors--uneven speciation and extinction rates; and artifacts such as choice of taxa included in the analysis, and imbalance introduced by outgroups and basal paraphyletic taxa. This artifactual imbalance accounts for tree shape convergence of large trees. SIGNIFICANCE: There is no evidence for any universal scaling in the tree of life. Instead, there is a need for improved methods of tree analysis that can be used to discriminate the noise due to outgroups from the phylogenetic signal within the taxon of interest, and to evaluate realistic models of evolution, correcting the retrospective perspective and explicitly recognizing extinction as a driving force. Artifacts are pervasive, and can only be overcome through understanding the structure and biological meaning of phylogenetic trees. Catalan Abstract in Translation S1.

Entities: Chemical Disease Species

Mesh：

Year: 2009 PMID： 19242549 PMCID： PMC2644784 DOI： 10.1371/journal.pone.0004611

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The quest for the Holy Grail inspired great deeds of all sorts, with little use in the end. A current parallel is the search for the Tree of Life, which written in capitals appears to have a Biblical dimension. Indeed, its mythology includes the notion that in such tree one could reach an understanding of life's diversification in the planet. It is thus not too surprising that its search has fired long, acrimonious polemics on the “right” path to truth, eventually looking more like religious wars in the quest of an unattainable dream than scientific arguments in search of the best approximation to reality. Nowadays the field of phylogenetics is healthily moving away from such confrontations, focusing instead in far more fertile avenues of research. However, the temptation of finding in phylogenetic trees the essence of life remains in the eyes of converts. The superficial resemblance of phylogenetic trees to real branching structures, such as real trees [1] and rivers [2], is at the origin of the quest for general patterns in the shape of phylogenies. Such possibility is indeed intriguing –if this idea had any validity, it should be possible to look for a grand unifying theme in the history of life. Along this line of thought, three recent meta-analyses report to have found regularities in the shape of phylogenetic trees [3]–[5], leading to claims that random models of evolution may explain life's diversification [4], as suggested by early studies [6], or even that there is a “universal scaling in the branching of the Tree of Life”, which would imply that “similar evolutionary forces drive diversification across the broad range of scales” [5]. If this was true, it would indeed be a remarkable finding. Evolution surely involves linear relationships from parents to offspring, and thus from ancestor to descendant. In order to depict such relationships in print, two-dimensional diagrams are customarily employed, called phylogenetic trees. Thus it seems logical to analyze these trees in order to address macroevolutionary questions [3], [7]–[18]. It is also possible to search for correlates of hierarchical dendritic structures and their properties, such as the relationship between fractal river basins and neutral models of the fish communities inhabiting them [19]. Thus, the geometry of phylogenetic trees deserves indeed a detailed study [14], [20]. However, phylogenetic trees are not real structures. They are almost certainly flawed reconstructions of historical events [14]. Moreover, these trees are just statistical inferences [21]. And most critically, they are calculated without seeking for universal laws and regularities, but instead with the goal of reconstructing particular historical events [22]. It is therefore essential to understand that not all phylogenetic trees have the same value, because they are complex hypotheses. The information content of a such a tree critically depends on at least three points: 1) the quality and quantity of information upon which it is based; 2) the validity of the method used to infer historical relationships; and 3) the fit of the inferred tree to the data. Thus, the worth of a particular phylogenetic tree may range from trivial to substantial, and its accuracy from mere guess to robust hypothesis. A straightforward conclusion is that any meta-analysis of phylogenetic trees performed with no control over their intrinsic merit is subject to severe pitfalls. In this context, the reported finding of a universal regularity in phylogenetic trees stems from a radical confusion of reality and diagram. Herewith I refute such claims, on the basis that they are based solely on artifact. The idea of universal scaling in phylogenies is completely unwarranted, being instead a consequence of bias in principles and methods. Further developments in the analysis of phylogenetic tree shape should avoid the artifact pitfalls, correcting distortions and reading the paramount signature of biological processes.

Results

The distribution of all possible trees in the space defined by A and C is not random (Fig. 1). All possible trees occur between the bounds imposed by the least and most structured possibilities –fully unresolved and pectinate, respectively. This is an intuitive result, but is relevant because only a small sector of the graph is actually occupied by trees (the remaining regions of the space represent network graphs that are not trees). Trees including polytomies (non-binary, or unresolved) occur throughout this sector. In contrast, all binary trees are bound by a lower limit representing symmetrical trees –i.e., all fully resolved trees lye between two limits: an upper, most structured limit, and a lower one representing average random trees. Thus, any tree located below the symmetrical tree expectation must include at least one polytomy.

Figure 1

Distribution of rooted, unlabeled trees in tree-shape space, defined by branch size (A) and cumulative branch size (C).

Distribution of rooted, unlabeled trees in tree-shape space, defined by branch size (A) and cumulative branch size (C).

All trees of up to 7 terminal taxa are shown. Solid symbols indicate binary trees, empty symbols stand for non-binary trees. Ellipses encompass all trees with the same number of terminal taxa (n). The lines are the interpolated expectation for three kinds of trees (the 4-taxa examples shown at right): totally symmetrical, random average (middle); pectinate, most imbalanced (top); and totally unresolved, trivial (bottom). The space actually occupied by all trees is limited by the upper and lower bounds. All binary (fully resolved) trees occur at or above the limit imposed by symmetrical trees. Only trees including at least one polytomy (non-binary, or unresolved) occur below this limit. The relationship between branch size (A) and cumulative branch size (C) for two analyzed phylogenetic trees (Fig. 2) is shown in Figure 3. At small branch sizes (A<101), the data can hardly be distinguished from this expectation, largely due to the narrow band available for small trees. Within a large intermediate section (roughly, 101

Figure 2

Two analyzed phylogenetic trees, redrawn unlabeled and with uniform internodal distances.

A) Fig. 7 from [24]; B) Fig. 1 from [25]. Ingroup taxa are Arachnida and Pectinidae, respectively. Outgroup taxa are marked by thick vertical lines. Basal non-monophyletic taxa are highlighted.

Figure 3

Relationship between branch size (A) and cumulative branch size (C) throughout two phylogenetic trees (shown in

Two analyzed phylogenetic trees, redrawn unlabeled and with uniform internodal distances.

A) Fig. 7 from [24]; B) Fig. 1 from [25]. Ingroup taxa are Arachnida and Pectinidae, respectively. Outgroup taxa are marked by thick vertical lines. Basal non-monophyletic taxa are highlighted.

Relationship between branch size (A) and cumulative branch size (C) throughout two phylogenetic trees (shown in

Fig. 2). Each data point represents a node. Notice the logarithmic scale on both axes. Open circles show data for tree A, solid dots stand for tree B. The diagonal line is the interpolated expectation from a random average, totally symmetrical tree. Arrows point at below-expectation values belonging to multifurcations. The dotted circle encloses rapidly diverging values belonging to outgroup and basal paraphyletic taxa. The few data points below the diagonal (indicated by arrows) represent trifurcations in both trees. These non-bifurcating nodes indicate unresolved nodes, the tree-building algorithm being unable to select one of two or more competing hypothesis about binary branching pattern for the three lineages involved. These three-stem nodes are not hypotheses of real multifurcation, being instead purely artifactual. Near the basal stem of the real trees (roughly, A≥102) the values of C conspicuously take off, showing that initial branching is most unbalanced in both trees. These deviating, extreme values represent outgroup and non-monophyletic basal taxa. Outgroups are non-arachnid chelicerates (Pycnogonida and Xiphosura) in tree A, and non-pectinids (Limidae, Propeamusiidae and Spondylidae) in tree B. Basal taxa that turn out to be non-monophyletic are the polyphyletic Acari in tree A (highlighted in pink), and the paraphyletic Limidae (blue), Propeamusiidae (green) and Aequipectinini (purple) in tree B. Given that the outgroups were chosen from distantly related taxa, and that poorly defined basal taxa are a heritage of pre-cladistic taxonomy, the deviating values near the root of both phylogenetic trees are just a consequence of method, and are thus purely artifactual. The resolution provided by the combined use of A and C is not optimal. The value of C is sensitive to the level at which imbalance and polytomies occur. Also, different trees often have the same pair of values. Moreover, both analyzed real trees yield similar scatter plots, in spite of being quite different. The 100 combinations of ingroup and outgroup taxa analyzed with four different tree-building methods yielded a non-random relationship between outgroups and the imbalance introduced by these (Fig. 4). The regression of tree imbalance (as measured by log ingroup imbalance) on the proportion of outgroups is highly significant for all four methods, as well as for the whole set of trees (Table 1). However, the regression coefficient ranges from low to moderate, given the wide dispersion of data points. Likewise, the regression slope also varies widely among subsets. The lowest values of r and slope are provided by the Bayesian trees, a reflection of their sensitivity to outgroup selection and their tendency to have high node support. At the opposite end, maximum parsimony yields the highest scores for r and slope, showing the comparative robustness of this method against variations in the outgroups chosen –parsimony uses outgroups basically to determine character-state polarity. Maximum likelihood and distance methods stand at mid range, probably due to the more algorithm-dependent ways in which they work. Taking all 400 trees together also yields intermediate values, as a result of averaging over the four methods. Pairwise covariance analyses among the four methods show that maximum likelihood and distance regressions are not significantly different, while maximum parsimony and Bayesian are distinct (Table 2).

Figure 4

Values of log outgroup imbalance plotted against the relative proportion of outgroups in the dataset of trees obtained applying four tree-building methods to 100 combinations of a set of outgroup and ingroup taxa.

Table 1

Regression analyses for different tree-building methods applied to 100 combinations of a set of outgroup and ingroup taxa.

Data set	equation	r²	p-value
ALL	y = −0.0417x+0.0693	0.356	<0.001
BA	y = −0.0185x+0.0086	0.143	<0.001
ML	y = −0.0413x+0.0854	0.345	<0.001
MP	y = −0.0607x+0.0961	0.672	<0.001
NJ	y = −0.0463x+0.0870	0.417	<0.001

Data sets are all trees (ALL), and trees obained with Bayesian (BA), maximum likelihood (ML), maximum parsimony (MP), and distance (NJ) methods. Variables are the proportion of outgroup taxa (x) and log outgroup imbalance (y). Regression lines are plotted in Fig. 4.

Table 2

Pairwise covariance analyses among the different tree-building methods shown in Table 1.

Comparison	x		method		x*method
	F	p-value	F	p-value	F	p-value
BA vs ML	12.62	<0.001	40.13	<0.001	9.71	<0.005
BA vs ML	17.39	<0.001	71.66	<0.001	45.47	<0.001
BA vs NJ	13.23	<0.001	48.83	<0.001	15.11	<0.001
ML vs MP	66.55	<0.001	0.81	>0.05	7.26	<0.005
ML vs NJ	53.68	<0.001	0.01	>0.05	0.39	>0.05
MP vs NJ	150.46	<0.001	0.61	>0.05	4.19	<0.05

Values of log outgroup imbalance plotted against the relative proportion of outgroups in the dataset of trees obtained applying four tree-building methods to 100 combinations of a set of outgroup and ingroup taxa.

Linear regressions are shown for each tree-building method, and for the whole set of 400 trees (thick black line). BA = Bayesian, ML = maximum likelihood, MP = maximum parsimony, NJ = BIONJ distance method. Data sets are all trees (ALL), and trees obained with Bayesian (BA), maximum likelihood (ML), maximum parsimony (MP), and distance (NJ) methods. Variables are the proportion of outgroup taxa (x) and log outgroup imbalance (y). Regression lines are plotted in Fig. 4. The imbalance attributable to outgroups in published phylogenetic trees shows a wide dispersion (Fig. 5). Linear regressions are not significant for maximum parsimony, maximum likelihood and distance-based trees, due to the extreme dispersion of data points. For Bayesian trees, a moderate relationship exists (y = −0.0634x+0.0437, r = 0.396, P<0.05), but this is probably a spurious result stemming from two artifacts –this method's sensitivity even when few outgroups are included, and the lack in this subset of trees with a high proportion of outgroups. Taking the whole set of published trees, a weak linear regression was found (y = −0.0186x+0.0259, r = 0.077, P<0.05). However, all values of log outgroup imbalance are normally distributed (mean = 0.0148, s.d. = 0.0436, AD = 0.543, P = 0.157), suggesting the existence of a constraining factor that keeps real trees close to a situation of null impact of outgroups on tree balance. Although most analyses have few outgroups and these appear to have a low, mostly positive impact on tree balance, values are mostly negative roughly between outgroup proportions around 1 and 2, and above 2 the few data points are close to zero. This suggests a non-linear relationship. Indeed, a quadratic regression (y = 0.0144x2−0.0554x+0.0366, r = 0.115, P<0.05) appears to be slightly better for all published trees. This curvilinear regression suggests that the constraining factor is particularly intense when outgroups clearly outnumber ingroup taxa.

Figure 5

Values of log outgroup imbalance plotted against the relative proportion of outgroups in the dataset of 61 published phylogenetic trees.

Data points labeled as in Fig. 4.

Values of log outgroup imbalance plotted against the relative proportion of outgroups in the dataset of 61 published phylogenetic trees.

Data points labeled as in Fig. 4.

Discussion

The distressing point from the comparison of different methods of phylogenetic inference is that all of them are quite sensitive to the artifacts introduced by outgroups. The differences among trees obtained with different methods are minor, and appear to be largely related just to the idiosincracy of algorithms. The good news, however, is that published phylogenetic trees appear to be under a remarkable, unexpected constraint. The constraining factor is most likely the fact that practicing taxonomists appear to be generally (and rather intuitively) aware of these artifacts, so they tend to choose carefully the array of outgroup taxa. A corollary of this is that there is no hope for any brute-force meta-analysis performed without consideration of what phylogenetic trees really mean and how they are obtained. A second consequence of this finding is that there is a wide open field for designing formal ways to discriminate the noise due to outgroups from the phylogenetic signal within the taxon of interest. The methods presented here and the following discussion may provide some guide. Not all phylogenetic trees are equally valid –in fact, there are huge differences in their robustness or support. This variable extent and reliability of phylogenetic hypotheses translates into artifacts in tree shape. For example, poor quality data introduce noise that results in increased imbalance [26]–[28]. Likewise, tree size does have an impact, because real large trees tend to approach a predictable, moderate level of imbalance [4]. These problems can be circumvented in part because tree shape and fit to the data appear to be unrelated [29], and there is at least one measure of imbalance that is independent of tree size [3]. Without being aware of these problems and how to treat them, one may gather a bewildering array of grossly dissimilar trees. Thus, having no control over what different trees mean surely will reduce any possibility of finding common rules. The three meta-analyses [3]–[5] were based on TreeBASE (http://www.treebase.org), a searchable, archival repository of data and scientific references [30], which can be explored by statistical packages designed to perform large-scale analyses of tree shape [15]. Only binary trees were included in [3], while polytomies were resolved under a random model in [4]. In order to ensure “testing the universality of the results derived across scales”, thousands of cladograms and a few dozen “intraspecific phylogenies” were compiled in [5]. This sampling was totally uncritical, aimed at amassing a bulk of different trees. Moreover, it was partially manual, although simply taking numerous trees with no selection criterion from the literature or from a repository database should yield virtually identical results. Basically, the problem is that it is unclear whether adding numerous hypotheses with an unknown degree of uncertainty may yield a credible global answer. Resolving phylogenetic trees into perfectly dichotomous branching patterns is a general goal in phylogenetics [31]. However, as any approach that imposes structure on the data, bifurcations are an imposition of method, not necessarily a reality [32]–[35]. All tree-building methods will force a binary tree on the data, but it has seldom been tested at what point of the analysis the conclusions might stretch beyond the assumptions, and thus at what level of detail it would be warranted to stop [21]. One such limitation involves short interior branches (i.e., fast evolutionary radiations), which may be even more prone to error in reality than predicted by theoretical studies [36]. Actually, it may not be really necessary to resolve a multifurcation “bush” (i.e., non-binary splits, or polytomies) in rapidly branching parts of a tree, because the temporal information encoded in that unresolved topology may be more relevant than the detailed sequence of bifurcations [31]. Another overstretching of methods occurs because above species level multifurcations that surely exist in evolution will always tend to be split. A justification may be that it is easier to work on a strictly binary set of nodes, although it is already possible to deal with polytomies in trees [11]. Ideally, the assumptions of systematists should be in agreement with those underlying tree-building algorithms [37]. However, even if there is a real dichotomous structure in the data, unresolved nodes will often occur mostly at or near the terminal branches, because the data analyzed are usually gathered with the goal of resolving mostly the intermediate taxonomic levels considered, and thus may not allow discriminating among very similar terminal taxa. Thus, the best resolution is generally in the middle of published trees. One must bear in mind that awfully unresolved trees are seldom published. Also, it is in the central area that the researcher's interest was in the first place. This explains departures from expected values in the left part of Fig. 3. It is also a good reason to prefer analyses in the tree space defined by A and C, given that it includes polytomous trees. The artifactual nature of binary trees is most relevant at or below the species level. Species may be incompletely isolated due to recent or incomplete speciation, the pattern of speciation may not be a simple cladogenetic event but may be instead paraphyletic, hybridization may cause reticulate evolution, and sorting of ancestral polymorphisms may render gene trees incongruent with species trees [17], [38]–[40]. Toward the contemporary tips of a phylogenetic tree, resolution is subject to the delimitation of species, a complex and often arbitrary issue that is not part of the phylogenetic inference process; eventually, recognizing the distinctiveness of individual taxa becomes problematic, because recent and incipient speciation may be difficult to identify [17], [41]. Even more problematic is portraying intraspecific variation as a branching tree. Within a species there is gene flow, so gene trees will most rarely be amenable to be translated directly into a history of population subdivision. It would be more meaningful to ask in the first place if there is an inherent hierarchical structure in data [34]. Actually, the clustering of subpopulations and the comparison of trees for different genes are by no means simple tasks, and dichotomous branching ordinations are just a small part of the methods available [42]. However, being aware of their meaning, they can be powerful tools in combination with other approaches to deal with intraspecific data [13], [43]. It is obvious that trees of intraspecific variation are actually simplified sketches, and thus have a radically different nature than interspecific trees. Thus, the mixing of intraspecific and interspecific trees in [5] has no justification, and their claims of uniform branching pattern above and below the species level are simply an artifact of applying similar binary-tree-building methods to different biological questions. At any rate, the high prevalence of multifurcations that exists among intraspecific trees reflects the inadequacy of tree-building methods for reticulate data, and their finding of lower-than expected values of C at short branch lengths is solely an artifact. The selection of trees is also a source of noise. In fact, different tree-building methods produce significantly different arrays of trees [3], [44]. This precaution was not taken into account by [4], [5], who mixed trees obtained from various kinds of tree-building algorithms –some distance-based (neighbor-joining), some based on parsimony, and still others on maximum likelihood. The differences between these methods can be shown to be rather of “degrees of freedom” [21], [45], yet they are based on different assumptions and often yield different outcomes for the same data matrix (as shown in Fig. 4). Moreover, real-world deviations from theoretical simple models of evolution may easily produce artifactual phylogenetic reconstructions under the commonly used models of sequence evolution, and it is still unclear how to capture the historical signal with a minimum of parameters to be estimated from the data [46]–[48]. Also, trees may differ if calculated with a naïve one-step process, or are derived from an approach that seeks to compare trees and find an average final model [20], [21] –even in simple 3-taxon cases, the outcome may differ strikingly, with substantial evolutionary implications [49]. Thus it remains unclear why trees obtained with different methods from a variety of taxa should be mixed up with no control. The value of a null model lies not in its mathematical elegance, but in its relevance to the question posed. On average, a totally balanced tree is also expected from Yule's equal-rates Markov model [3], [50], [51], but this kind of tree would be most unusual for any large set of real taxa. In the case of phylogenetic trees, null models based on random, increasing, balanced diversification [5], [6] were only a reasonable early start. More elaborate stochastic models exhibit an enhanced approach to real trees [3], [4], but it is unclear whether there is any reason to prefer any such model beyond a rough fit to the data and the rejection of the overly simplistic Yule model. Clearly, more realistic models are needed that place randomness right where relevant variables impact the model's behavior [16], [17], [52]–[54]. From this viewpoint, it should come as no surprise the finding in all three meta-analyses [3]–[5] that the average imbalance of phylogenetic trees inferred from real data falls neatly in between extreme possibilities (i.e., the symmetric and pectinate trees in [5]; the random and uniform models in [3]; and random and pectinate trees in [4]). The departure of real trees from random models can be caused at least by two major natural factors, and two artifacts. The first natural factor is simply that extinction does occur, so not all lineages can continue to divide at the specified rate. As lineages go extinct along a tree, its imbalance will almost inevitably increase. This is a consequence of extinction being the outcome of complex dynamics, so it is not reasonable to expect that it should remain stable across the tree. The second natural deviating factor is that diversification rates will surely vary across the different branches of the tree over time, because it is a complex function of a plethora of intrinsic and environmental factors operating on living organisms. Several methods have been devised to estimate absolute rates of speciation and extinction, showing that large variation in those parameters is the rule [55]–[63]. Indeed, balanced random processes are too slow to account for most patterns of observed diversity, yet diversification is subject to complex environmental constraints [17], [53]. A reflection of such complexity is likely to result in autocorrelation of diversification rate along lineages [8]. Thus, real phylogenies should be expected to range throughout all possible topologies, with no reasonable way of a priori delimiting tree space. Aside from real-world issues, the two major artifacts that increase imbalance are related to the taxa included in the analysis. On one hand, all known taxa from a given group are rarely included, so some choice has to be made. Often this may be imposed by the availability of samples. However, it may be difficult to know whether species have been removed from the analysis deliberately and selectively [26]. And including selected species from high-rank taxa may cause problems of two sorts. Actually, real trees are quite imbalanced, and more so if the taxa are above the species level [39]. In addition, such large branches will inevitably result in underestimation of real change, and thus of long branch lengths. This is the pervasive node-density artifact, whose impact on tree shape is still unclear [64]. At any rate, non-random taxon sampling will cause errors in estimates of speciation and extinction rates, more so than just incomplete taxon sampling [65], [66]. Indeed, the inclusion of evolutionarily isolated species may affect synthetic measures of phylogenetic trees [67]. On the other hand, outgroups (used to place the root of the tree) are a definite source of imbalance. At the highest taxonomic levels considered, C has higher-than-expected values, indicating that long branches tend to be more pectinate. But this is due to the inclusion of selected taxa from progressively more distantly related lineages. This is routinely done in order to provide various outgroups. This is justified because, based on sampling theory, the more dense the sampling of outgroup taxa, the more stable the internal topology will be and the stronger the test for the monophyly of the ingroup [68], [69]. Being clear that outgroup taxa significantly contribute to an excess of imbalance [3], [21], [70], there is a motive for removing outgroups from tree analysis [3], [4]. Unfortunately, the outgroup taxa are often not displayed in the published trees, and it is frequent that more outgroups are included than those explicitly identified as such. Actually, outgroups often involve more than just the first low-diversity branch, or the usual basal one or two single-species branches. In some instances, such as in tree A (Fig. 2), a priori outgroups turn out not to be the branches closest to the root, making any automated identification and deletion of outgroups highly suspect. This problem is exacerbated if the basal taxa turn out to be paraphyletic [39], because they will appear as pectinate long branches. The two trees analysed in detail (Fig. 3) show several basal branches that belong to outgroups that are revealed to be paraphyletic. Actually, higher taxa that have traditionally been considered as basal to other higher-order taxa often turn out to be paraphyletic when subject to cladistic evaluation –the Acari, Limidae, Propeamussiidae and Aequipectinini are likely candidates to join the club of outfashioned, unnatural groups such as the Protobranchia, Reptilia, and Pongidae. Without a proper identification of outgroup taxa, coupled to a taxonomic assessment of any basal paraphyletic taxa, it is very hard to control for the pervasive artifact of imbalance increasing at the highest taxonomic levels of published trees. Therefore, the reported findings of imbalance increasing at large tree sizes stems from this control being insufficient in [4] and just missing in [5], and thus appears to be totally caused by the outgroup and basal paraphyly artifacts. Various tree-shape statistics have been divised, whose merits vary widely. Most of these methods extract a single summary index from the distribution of nodes, so it's not too surprising that the majority of such measures of tree shape are sensitive to the level, or depth in the phylogeny at which imbalance is concentrated [3], [71] and to the presence of polytomies [36]. As summarized in Fig. 1, C suffers from these same shortcomings. Focusing instead on the dispersion of node traits in a bidimensional plot aims at capturing more of the tree's features [72], although interpretation of such analysis is also difficult [3], [10]. Likewise, estimates of the alpha model fail to adjust extreme tree shapes and often yield a zero value [3], thus being also hard to interpret. As shown above, the relationship between A and C can be used to locate and explain imbalance in the different regions of a given tree, even if there are polytomies. The drawbacks of this method are that it does not have optimal resolution because different trees yield identical values, and all trees are constrained within a small sector of geometric space, so even quite distinct trees will yield similar plots. Nevertheless, it is clear that the two phylogenies in Fig. 2 have quite different shape, yet are translated into overall similar plots in Fig. 3. It is also relevant to notice that these two parameters can be used to design meaningful measures (such as log outgroup imbalance) of the impact of outgroups (and possibly other artifacts) in tree space. Thus, the uniform relationship among branch size A and cumulative branch size C is due to a narrow design of methods, not a quality of results. A third avenue is to compare trees strictly in terms of what they are –high-dimensional parameters amenable to geometrical depictions in ultrametric space [73]. Actually, ultrametrics have been successfully applied to a variety of questions where data have a hierarchical structure [34], [74], [75]. This perspective allows the exploration of geometric space [14], [20], [76], without relying on simulations, and leading to the application of statistical methods [21]. It is thus possible to develop a measure of resolution for different tree-shape statistics, and thus select those statistics that have similar values only for similar trees [14]. The analysis shown in Fig. 1 is a step in this direction, pointing at further developments in generalized tree shape distribution. However, there is a critical caveat to any analysis of the shape of phylogenetic trees. Our perspective being inevitably from the present, extant diversity always appears to come out of a burst from a distant single stem [17]. Virtually all real trees will have a rather “conical” shape, due to the fact that the recent splits considered are many more than old surviving lineages. Including extinct taxa should help in correcting this retrospective illusion, but the incompleteness of the fossil record will always play against such correction. But this leads to a second obstacle, which is related but more difficult to tackle –what exactly are fossil taxa that are basal to later diversification. In an orthodox cladistic framework, such an extinct species will always be treated as the sister group of all later branches, provided the traits of later taxa can be inequivocally identified in their earliest stages. Now, this methodological shortcut may not always provide an accurate description of reality, our placing of those early stems, or “species germinalis”, being strongly dependent on later evolution that is only apparent from our contemporary point of view [77]. Clearly there is a challenge to develop methods for correcting our “convex from the present” view of phylogenetic trees prior to analysis of their actual shape and information content. In spite of grand declarations, the Darwinian goal of classifying organisms in terms of their relationships of common descent has powered evolutionary research and is at the root of the field of phylogenetics. There is really nothing like universal scaling in phylogenetic trees –and no good reason why it should exist. We are dealing with attempts to understand history [22], thus a phylogenetic tree is only a diagram of a complex irreversible process. In this sense, the linking of TreeBASE to databases providing information on the taxa actually included in each analysis [78] is a valuable addition that should help in assessing the significance and merits of each tree before including it in any meta-analysis. Beyond failures based on unreasonable assumptions and oversimplistic paradigms, the wealth of information encoded in phylogenetic trees is there to be deciphered. However, this will not happen with any uncontrolled meta-analysis, but only through an integration of population genetics, ecology, paleontology, and graph theory. Artifacts pave the way, and they can only be overcome with an understanding of the structure and biological meaning of phylogenetic trees. Exploring the geometry of unlabeled trees with constant internodal distances represents only an initial approach. It is critical to notice that taking tree topologies alone explicitly disregarding any time scale has the implicit problem of obviating extinction. Actually, time on a phylogeny does matter, at least because individual branch lengths actually are estimates of different processes depending on where they are located within the tree. Towards the terminal taxa, individual branch lengths estimate the inverse of the speciation rate, but at the basal regions they rather estimate the inverse of the diversification rate, being the difference between the speciation and extinction rates [79]–[81]. It may even be possible to distinguish decreasing speciation from increasing extinction in early evolutionary radiations [63]. This is relevant to methods such as the lineage-through-time approach [82], [83], which ignores extinct lineages and is thus sensitive to the effects of poor sampling of taxonomic diversity, as well as to its intrinsic inability to distinguish reduced extinction and enhanced speciation [17]. Although the variability of branch lengths in real trees can be used to test hypothesis about evolutionary rates [65], [84], precise estimation of these rates requires large phylogenetic trees [85], and it is still unclear how to assess in general the impact of disappearing lineages on the shape of phylogenetic trees. Although it is episthemologically impossible to read directly the empty space left by vanished taxa, the contribution of missing branches to the observed patterns remains as a signature to be deciphered. Eventually, it is the biological phenomenon of extinction that imposes an ultrametric structure on phylogenetic trees, because the unavoidable disappearance of interfertile individuals and intermediate taxa throughout life's history sets apart the surviving lineages and promotes the growth of biodiversity.

Materials and Methods

All rooted, unlabeled trees consisting of up to 7 terminal branches (unnamed taxa) were enumerated, separating binary (fully resolved) trees from those having at least one polytomy (i.e., having one unresolved node). Among the variety of indices devised to sumarize tree shape, the values of branch size (the number of subtaxa from a given node, A) and cumulative branch size (the sum of the sizes of all branches from a given node, C), the two variables measured in [5], were manually calculated for each tree. In order to explore the distribution of all trees in an A vs. C plot (Fig. 1), these values were calculated also for three series of trees: perfectly symmetrical trees, which are expected on average from a purely random branching process; pectinate trees, which are most imbalanced; and totally unresolved trees, being the trivial bottom-line with one single node. Each series was drawn as a line; this is a continuous interpolation that allows drawing a simple limit in this tree space [23]. Two data-rich phylogenetic trees were selected from recent literature: Fig. 7 in [24], and Fig. 1 in [25]. They belong to different phyla (Arthropoda and Mollusca) and different environments (terrestrial and marine, respectively). Both include only distinct species (i.e., there are only undisputed individual terminal branches), are relatively large (≥60 terminal taxa), include several outgroups and non-monophyletic basal taxa, are the product of excellent scholarship on DNA sequences, and are considered by their authors as working hypotheses likely to change with the inclusion of further evidence. They are shown in Figure 2, redrawn in order to depict only their topology. Tree A is more balanced near the terminal taxa, while tree B is more balanced near the root. The values of A and C were calculated for all subtrees in both trees. A log-log plot of A vs. C was drawn in order to show deviations from the symmetrical tree expectation (Fig. 3). In order to explore the variation in the relationship between A and C in relation to the proportion of outgroups included in phylogenetic analyses, different tree-building methods were applied to various combinations of a given set of ingroups and outgroup taxa. The aminoacid sequences included in this analysis belong to the AAA (ATPases Associated with a wide variety of cellular Activities) protein (either replication factor C small subunit, or DNA polymerase III gamma subunit), introduced as example in the Phylogeny.fr [86] data window (viruses excluded): 10 eukaryots considered as ingroup taxa, and 9 prokaryots (4 Eubacteria and 5 Archaea) taken as outgroups. The species considered are (followed by accession number in the Entrez database: http://www.ncbi.nlm.nih.gov/sites/entrez?dbprotein): Plasmodium chambaudi (XP_745209), Trypanosoma brucei (XP_829019), Dictyostelium discoideum (XP_629875), Schizosaccaromyces pombe (NP_593121), Ustilago maydis (XP_756876), Arabidopsis thaliana (NP_176504), Caenorhabditis elegans (NP_500069), Anopheles gambiae (XM_308395.4), Strongylocentrotus purpuratus (XP_790650), Homo sapiens (NP_002905), Aquifex aeolicus (NP_214275), Polaribacter irgensii (ZP_01118896), Ehrlichia ruminantium (YP_196867), Neisseria meningitidis (NP_284372), Methanosarcina acetivorans (NP_615630), Haloarcula marismortui (YP_137064), Halobacterium species NRC-1 (NP_280914), Methanosphaera stadtmanae (YP_447457), and Methanospirillum hungatei (YP_502463). A total of 100 combinations of ingroup and outgroup taxa were selected, spanning throughout all possible values of the ingroup/outgroup ratio. For each combination of taxa, an independent analysis was performed using the Phylogeny.fr platform (http://www.phylogeny.fr). Sequences were aligned with MUSCLE [87], and phylogenetic trees were estimated through four different methods: Bayesian approach using MrBayes (ver. 3.1.2) [88] with GTR option for substitution types, invariable and gamma rate variation across sites; maximum likelihood using PhyML (ver. 3.0 aLRT) [89], [90]; maximum parsimony as implemented in TNT (ver. 1.1) with sectorial search and tree fusing [91], [92]; and distance analysis using BIONJ [93]. The Bayesian analyses included a Monte Carlo Markov Chain with 10,000 generations, sampling a tree every 10 generations, and discarding the first 250 trees sampled as burn-in. The other three methods involved 100 bootstrap replicates, yielding strict consensus trees. Nodes with support values below 50% were collapsed. The root was placed between the Archaea and the Eubacteria (or in rare cases the group formed by these and one archaeon). Values of A and C were calculated manually for each of the 400 resulting trees and their ingroup set. The difference in the C/A ratio (taking logarithmic values) between the whole tree and after deleting the outgroups is called log outgroup imbalance, and is a measure of the change in relative position within the tree space defined by these two variables (shown in Fig. 2). Thus, a positive value means a steeper position of the whole tree relative to the ingroup set for the position in that tree space, due to a positive contribution of the outgroups to tree imbalance. A negative value means a drop in relative position when outgroups are considered, meaning that outgroups actually decrease tree imbalance. The values of log outgroup imbalance were plotted against the relative proportion of outgroups in the dataset (Fig. 4). Linear regressions were calculated for the whole set of 400 trees, and separately for those obtained by each tree-building method. These regressions were compared pairwise through analyses of covariance. In order to test whether the relationship found occurs in other datasets, a total of 61 published phylogenies (including the two already analyzed) were selected (Table 3) [94]–[125]. This is an explicitly ecclectic selection of studies, based on the variety of my interests and readings. It is no more arbitrary than a random download from a database of phylogenetic trees, and no less rigorous than a well-posed query to it –actually it is more reliable because trees were selected only after scrutiny of the actual papers where they have been published. The species involved span throughout a wide variety of eukaryots, and the supraspecific ingroup taxa range from a single genus to a whole class. The data are only nucleic acid sequences, and the period of publication is the last 11 years. Most papers provided one tree, although in several instances the same dataset was analyzed with different methods, and different methods are sometimes applied to different datasets. Thus, every tree sampled is taken as independent. Only species-level taxa were considered; thus whenever populations belonging to the same (sub)species represented different branches these were united. Nodes with support values below 50% were collapsed. In all cases, the outgroups were those actually included in the analysis –this is often clear in the illustrated phylogenetic trees, but in a few papers it is only evident in the text. Values of A and C were calculated manually for each of the 400 resulting trees and their ingroup set, and log outgroup imbalance was plotted against the relative proportion of outgroups in the dataset (Fig. 5). The fit of all these log outgroup imbalance values to a normal distribution was tested with the Anderson-Darling goodness-of-fit statistic. Linear and quadratic regressions were calculated for the whole set of published trees, as well as for subsets of trees obtained with different methods.

Table 3

Published phylogenetic trees analyzed. Trees are ordered by method of inference (BA = Bayesian, ML = maximum likelihood, MP = maximum parsimony, NJ = distance), proportion of outgroups relative to ingroup taxa (out/in), and log outgroup imbalance (LOI). Values of A and C are given for the complete trees and for ingroup taxa only.

method	in	out	out/in	A all	C all	A in	C in	LOI	taxa
BA	19	2	0.1053	39	299	35	221	1.3524	terrestrial pulmonates	[94]
BA	6	1	0.1667	13	55	11	41	0.5035	passerine birds	[95]
BA	85	28	0.3294	215	2457	161	1255	3.6329	lower neopterous insects	[96]
BA	3	1	0.3333	7	19	5	11	0.5143	terrestrial caenogastropods	[97]
BA	10	4	0.4000	26	115	18	72	0.4231	cichlid teleosts	[98]
BA	35	17	0.4857	100	994	68	543	1.9547	centaurine composites	[99]
BA	5	3	0.6000	13	45	8	19	1.0865	passerine birds	[100]
BA	34	25	0.7353	116	905	67	533	0.1535	carnivore mammals	[101]
BA	34	28	0.8235	122	923	67	417	1.3417	pancrustaceans	[102]
BA	13	11	0.8462	45	310	24	110	2.3056	aquatic pulmonates	[103]
BA	25	34	1.3600	116	905	48	256	2.4684	carnivore mammals	[101]
BA	3	5	1.6667	14	39	5	11	0.5857	plethodontid salamanders	[104]
ML	17	1	0.0588	35	245	33	209	0.6667	mammals	[105]
ML	15	1	0.0667	30	113	29	111	0.0609	terrestrial pulmonates	[106]
ML	12	1	0.0833	24	97	23	95	0.0888	aquatic caenogastropods	[107]
ML	40	4	0.1000	82	681	74	438	2.3860	mammals	[108]
ML	14	2	0.1429	27	93	24	83	0.0139	procellariiform birds	[109]
ML	54	8	0.1481	121	920	106	755	0.4807	pancrustaceans	[102]
ML	25	4	0.1600	51	425	43	229	3.0078	terrestrial pulmonates	[94]
ML	6	1	0.1667	13	55	11	41	0.5035	passerine birds	[95]
ML	50	9	0.1800	112	1200	95	838	1.8932	pectinid bivalves	[25]
ML	15	3	0.2000	28	105	23	71	0.6630	nemerteans	[110]
ML	10	4	0.4000	24	86	16	45	0.7708	cichlid teleosts	[98]
ML	5	3	0.6000	15	67	9	25	1.6889	passerine birds	[100]
ML	26	16	0.6154	72	485	44	256	0.9179	rodent mammals	[111]
ML	7	6	0.8571	21	83	13	53	0.1245	perameloid marsupials	[112]
ML	3	7	2.3333	16	63	5	11	1.7375	insectivore mammals	[113]
MP	44	1	0.0227	89	735	87	645	0.8446	decapod crustaceans	[114]
MP	38	2	0.0526	78	337	75	257	0.8938	vetigastropods	[115]
MP	25	2	0.0800	48	324	45	274	0.6611	anguid lizards	[116]
MP	14	2	0.1429	30	163	27	155	0.3074	procellariiform seabirds	[109]
MP	13	2	0.1538	27	133	24	104	0.5926	aquatic caenogastropods	[117]
MP	6	1	0.1667	13	55	11	41	0.5035	passerine birds	[95]
MP	23	4	0.1739	38	322	30	211	1.4404	conifers	[118]
MP	15	3	0.2000	27	103	22	69	0.6785	nemerteans	[110]
MP	14	4	0.2857	28	168	20	64	2.8000	pond turtles	[118]
MP	85	28	0.3294	214	3057	159	1657	3.8637	lower neopterous insects	[96]
MP	21	7	0.3333	40	184	31	125	0.5677	pond turtles	[119]
MP	10	4	0.4000	23	79	15	39	0.8348	cichlid teleosts	[98]
MP	10	4	0.4000	27	137	19	77	1.0214	passerine birds	[120]
MP	38	16	0.4211	87	569	61	327	1.1796	rodent mammals	[111]
MP	14	7	0.5000	41	297	27	125	2.6143	amphibians	[121]
MP	18	10	0.5556	50	265	31	139	0.8161	juglandaceans	[122]
MP	10	6	0.6000	31	145	19	73	0.8353	anseriform birds	[123]
MP	5	3	0.6000	13	46	8	19	1.1635	passerine birds	[100]
MP	7	5	0.7143	22	89	12	33	1.2955	unionoid bivalves	[124]
MP	34	25	0.7353	114	888	66	484	0.4561	carnivore mammals	[101]
MP	35	28	0.8000	123	1250	95	640	0.0629	arachnids	[24]
MP	13	11	0.8462	45	275	24	114	1.3611	freshwater pulmonates	[103]
MP	25	34	1.3600	114	888	47	290	1.6193	carnivore mammals	[101]
MP	3	5	1.6667	12	35	5	11	0.7167	plethodontid salamanders	[104]
MP	3	7	2.3333	18	82	5	11	2.3556	insectivore mammals	[113]
MP	5	16	3.2000	37	187	21	81	1.1969	passerine birds	[120]
NJ	13	2	0.1538	28	143	25	113	0.5871	aquatic caenogastropods	[117]
NJ	37	6	0.1622	79	512	68	405	0.5251	protochordates	[125]
NJ	6	1	0.1667	13	55	11	41	0.5035	passerine birds	[95]
NJ	10	4	0.4000	24	99	16	59	0.4375	cichlid teleosts	[98]
NJ	14	7	0.5000	41	285	27	113	2.7660	amphibians	[121]
NJ	5	3	0.6000	15	59	9	25	1.1556	passerine birds	[100]
NJ	13	11	0.8462	41	195	22	82	1.0288	aquatic pulmonates	[103]
NJ	3	7	2.3333	17	66	5	11	1.6824	insectivore mammals	[113]

Abstract translated into Catalan (0.03 MB DOC) Click here for additional data file.

86 in total

Universal artifacts affect the branching of phylogenetic trees, not universal scaling laws.

Introduction

Results

Distribution of rooted, unlabeled trees in tree-shape space, defined by branch size (A) and cumulative branch size (C).

Two analyzed phylogenetic trees, redrawn unlabeled and with uniform internodal distances.

Relationship between branch size (A) and cumulative branch size (C) throughout two phylogenetic trees (shown in

Values of log outgroup imbalance plotted against the relative proportion of outgroups in the dataset of trees obtained applying four tree-building methods to 100 combinations of a set of outgroup and ingroup taxa.

Values of log outgroup imbalance plotted against the relative proportion of outgroups in the dataset of 61 published phylogenetic trees.

Discussion

Materials and Methods

1. Simple but fundamental limitations on supertree and consensus tree methods.

2. Whole-tree methods for detecting differential diversification rates.

3. apTreeshape: statistical analysis of phylogenetic tree shape.

4. Detecting the node-density artifact in phylogeny reconstruction.

Review 5. Estimating diversification rates from phylogenetic information.

6. DO PHYLOGENETIC METHODS PRODUCE TREES WITH BIASED SHAPES?

7. A molecular phylogeny of the Sylvia cantillans complex: cryptic species within the Mediterranean basin.

8. Bushes in the tree of life.

9. Confirming the phylogeny of mammals by use of large comparative sequence data sets.

10. Influenza A H5N1 immigration is filtered out at some international borders.

1. Scale-invariant topology and bursty branching of evolutionary trees emerge from niche construction.