Literature DB >> 29931521

Phylogenetics beyond biology.

Nancy Retzlaff^1,2, Peter F Stadler^{3,4,5,6,7,8,9}.

Abstract

Evolutionary processes have been described not only in biology but also for a wide range of human cultural activities including languages and law. In contrast to the evolution of DNA or protein sequences, the detailed mechanisms giving rise to the observed evolution-like processes are not or only partially known. The absence of a mechanistic model of evolution implies that it remains unknown how the distances between different taxa have to be quantified. Considering distortions of metric distances, we first show that poor choices of the distance measure can lead to incorrect phylogenetic trees. Based on the well-known fact that phylogenetic inference requires additive metrics, we then show that the correct phylogeny can be computed from a distance matrix [Formula: see text] if there is a monotonic, subadditive function [Formula: see text] such that [Formula: see text] is additive. The required metric-preserving transformation [Formula: see text] can be computed as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process remains elusive.

Entities: Chemical

Keywords: Additive metric; Cultural evolution; Metric-preserving functions; Phylogenetic tree

Mesh：

Substances：
Proteins
RNA
DNA

Year: 2018 PMID： 29931521 PMCID： PMC6208858 DOI： 10.1007/s12064-018-0264-7

Source DB: PubMed Journal: Theory Biosci ISSN： 1431-7613 Impact factor: 1.919

Introduction

At the most abstract level, evolution can be seen as a consequence of the generation of variation and selection. Since selection acts to remove entities from the system, it will eventually “die out” unless counteracted by some form of reproduction. Sustained evolution thus necessarily operates on populations of entities. The history of an evolutionary process can be recorded in the form of a directed graph: Dress et al. (2010b) considered the set comprising “all organisms that ever lived on earth” arranged into a graph with arcs (directed edges) connecting to nodes and whenever was a “parent” of , defined in a rather loose sense as having contributed directly to the genetic make-up of . These arcs encode not only father and mother in sexually reproducing populations, but also horizontal gene transfer, hybridization, the incorporation of retroviruses into the genome, etc. Since arcs encode ancestry, is acyclic. The very same construction applies to many other systems that are perceived as evolutionary. For example, in the evolution of languages one may consider the mutual influences of speakers or, even more fine grained, individual utterances as the basic entities (Croft 2000; Pagel 2009). The same is true for the transmission of cultural techniques, designs, and conventions (Mesoudi et al. 2006). Well-studied cases include the transmission of texts (Greg 1950), in particular manuscripts, and text reuse, i.e., the borrowing of parts of a corpus, with or without modifications, in the process of creating a new text, see, e.g., Seo and Croft (2008). Similarly, the revisions of the law as dissenting interpretations can be seen in this manner (Roe 1996). The common ground of these and presumably many other systems is that a limited set of entities at some point or interval in time “informs” limited sets of entities in their (usually immediate) future. The key result of Dress et al. (2010b) is that several types of clusters on the subset of organisms that are currently alive can be defined from the structure of the graph Many of these form hierarchies and therefore define a tree. These clusters naturally take on the role of taxa, and the corresponding trees consequently are a meaningful representations of the phylogenetic relationships among these taxa. The same interpretation is meaningful, as we argued above, also for many—but presumably not all—aspects of human cultural endeavors. Notions of cultural evolution (see, e.g., Flannery (1972), Mesoudi et al. (2006)) are therefore more than a convenient metaphor. Instead, for a given system of interest, one has to ask whether or not the corresponding graph shares key features with the one obtained from conceptualizing biological evolution. There is no a priori reason to assume, for instance, that always gives rise to the tree-like abstraction that is at the heart of biological evolution. This is an inherently empirical question that needs to be answered for each “evolutionary” system under consideration. Human languages, for instance, are a prime example of an aspect of human activity that closely conforms to biological evolution. The key point here is that a phylogenetic structure is an emergent phenomenon of the underlying evolutionary process; it requires that there exists a level of aggregation in G that produces clusters adhering to an (essentially) hierarchical structure. Although Dress et al. (2010b) provide a formal justification for phylogenetic reconstruction with their analysis of the graph , their work does not attempt to provide a practical procedure to identify the relevant clusters, i.e., the taxa. After all, these are defined in terms of the graph , which of course is not directly observable. In fact, usually not even the set of extant entities will be known completely, as we will have to be content with a subset of available data. In general, neither the “true nature” of the elementary entities nor a complete description for each of them is available to us. Instead, we have to be content with measured representations. For instance, in molecular phylogenetics, it is customary to represent a taxon by a set of sequences (usually representing single copy protein coding genes) obtained from one or more individuals. Morphological approaches in phylogenetics use a list of characters such as features of bones or organs to represent a typical individual. The impact of the choice of representation on the results of phylogenetic reconstructions has long been recognized in morphological phylogenetics and has been the subject of a long-standing debate, see, e.g., Wiens (2001). The fundamental assumption that is made in any type of similarity-based phylogenetic analysis is that similarity of representations reflects evolutionary relatedness, i.e., proximity in , and therefore also makes it possible to identify the hierarchical cluster systems that are defined in terms of . This is well established, of course, in the case of molecular phylogenetics, where a detailed model of sequence evolution is available (Jukes and Cantor 1969; Tavaré 1986; Arenas 2015). Similarly, permutation distances directly count genomic rearrangement events (Hannenhalli and Pevzner 1995). The connection is much less clear for morphological phylogenetics, where choice and even the concept of “character” is under debate, see, e.g., Wagner (2001), Wagner and Stadler (2003) for a formal discussion. In many cases, it seems difficult to construct a theory that links distance or similarity measures directly to an underlying evolutionary process. This is the case for instance in phylogenetic applications of distances between RNA secondary structures (Siebert and Backofen 2005) or the use of distance measures based on data compression (Cilibrasi and Vitanyi 2005; RajaRajeswari and Viswanadha Raju 2017). Phylogenetic methods have also been employed in the humanities. Relationships among languages, for instance, can be captured by using cognates (i.e., words with a common origin) as characters, see, e.g., Gray et al. (2011), Holman and Wichmann (2017). Recently, sophisticated statistical approaches, that model, e.g., the importance of sound change, have been used to reconstruct language trees, see, e.g., Bhattacharya et al. (2018) for a recent overview. In stemmatics, differences between editions or manuscripts serve as characters from which the relationships, e.g., between the many different versions (O’Hara and Robinson 1993; Barbrook et al. 1998; Marmerola et al. 2016) can be reconstructed. Occasionally, material artefacts are considered. Tëmkin and Eldredge (2007) studied used phylogenetic methods to study the history of certain musical instruments. A broader perspective of phylogenetic approaches in cultural evolution is discussed, e.g., by Mesoudi et al. (2006), Steele et al. (2010) or Howe and Windram (2011). It is a well-known fact in sequence analysis that not all (reasonable) distance measures lead to faithful reconstructions of phylogenies. It is a well-established practice, in fact, to correct for back-mutations, i.e., to transform raw counts of diverged sequence positions, i.e., the Hamming or Levenshtein distances, into distance measures that can be interpreted as numbers of evolutionary events or divergence times. Depending on the level of insights into the data, the simple Jukes–Cantor model (Jukes and Cantor 1969) or one of the many much more elaborate models (Tavaré 1986; Arenas 2015) is used for this purpose. In the field of alignment-free sequence analysis, on the other hand, the focus is on the efficient computation of dissimilarity measures, without overt concern of the measure’s connection to a dynamical model of evolution (Vinga and Almeida 2003). One has observed, however, the distance measures that do well in a phylogenetic context also correlate very well with model-based distances (Edgar 2004; Haubold et al. 2009; Leimeister and Morgenstern 2014). We suspect that this reflects the fact that a particular subclass of metrics, the so-called additive metrics, conveys complete phylogenetic information, see “Distance-based phylogenetics” section. We therefore make a strong assumption throughout this contribution:

Assumption A

Given a complete and correct model of the evolutionary dynamics on a suitable constructed space , there is an additive metric distance measure on that measures the cumulative change along each lineage. An immediate consequence is that phylogenetic relationships can be reconstructed unambiguously if is known. There is, of course, no reason to think that Assumption A holds in real life. In particular, it is certainly violated by all processes that lead to reticulate patterns in evolution, such as incomplete lineage sorting, horizontal gene transfer, and hybridization (Gontier 2015). The purpose of this contribution, therefore, is to ask how much (or how little) we need to know about the “true” metric t to be able to infer the correct phylogenetic tree . More precisely, we investigate here the consequence of distorted distance measurements: Suppose that instead of we can infer from the data only a “deformed” dissimilarity measure , where is an unknown function about which only some qualitative features can be known. We then ask: How much information about , and thus the underlying phylogenetic tree, does still convey?

Distance-based phylogenetics

A map is a metric if it satisfies, for all :Distance measures can be used for clustering and thus serve as a means of extracting hierarchical, i.e., tree-like, structures on a set of data. If then . . . The basis of distance-based phylogenetic methods is additive metrics, i.e., metrics that are representations of edge-weighted trees. Consider a tree with leaf-set and a length function defined on the edges of . Recall that every pair of leaves and is connected by a unique path in . The length of this path, i.e., the sum of its edge lengths, defines the distance . Additive metrics are those that derive from a tree in this manner. A famous theorem (Buneman 1974; Cunningham 1978; Dobson 1974; Simões-Pereira 1969) shows that additive metrics are characterized by the four-point condition: A metric is additive if and only if for any four points holdsThe appearance of additive metrics in evolutionary processes can be justified rigorously for specific models. For example, Markovian processes on strings of fixed length lead to distances that can be estimated directly from the data: Denoting by the fraction of characters in which has state and has state , which for each pair () can be arranged in a matrix . Steel (1994) showed that (the expected values of) form an additive metric. Well-known results from phylogenetic combinatorics show that given an additive metric, the tree and its edge lengths can be reconstructed readily, see, e.g., the work of Apresjan (1966), Imrich and Stockiĭ (1972), Buneman (1974), Dress (1984), Bandelt and Dress (1992), Dress et al. (2010a). The well-known neighbor-joining algorithm (Saitou and Nei 1987), a special case of a large class of agglomerative clustering algorithms, furthermore, solves this problem efficiently and was shown to always compute the correct tree when presented with an additive metric, see the survey by Gascuel and Steel (2006) and the references therein. Additivity of the underlying metric is also assumed in a recent generalization of phylogenetic trees that allows data points to appear not only as leaves but also as interior vertices of the reconstructed tree (Telles et al. 2013). A stronger condition than additivity is ultrametricity, which is characterized by the strong triangle equationCondition (MU) means that all triangles are “isosceles with a short base”, i.e., the length of two sides of the triangles is equal and the third one is at least not longer than these two. Ultrametrics appear in phylogenetics under the assumption of the strong clock hypothesis, i.e., constant evolutionary rates (Dress et al. 2007). Dating of the internal nodes (Britton et al. 2007) transforms an (additive) phylogeny into an ultrametric tree. Ultrametrics are a special case of additive metrics. Real-life data sets, unfortunately, almost never satisfy the four-point condition. As a remedy, Sattah and Tversky (1977) and Fitch (1981) suggested to consider a “split relation” on pairs of objects, often referred to as quadruples, defined byThe relation has been studied extensively and, under certain additional conditions, can provide sufficient information for reconstructing phylogenetic trees (Bandelt and Dress 1986) or at least phylogenetic networks (Bandelt and Dress 1992; Grünewald et al. 2009). The approximation of a given metric by additive metrics or ultrametrics given some measure of the goodness of fit has also received quite a bit of attention (Farach et al. 1996; Agarwala et al. 1998; Apostolico et al. 2013). Here, we ask under which conditions distance data that may deviate from additivity in a systematic manner still yield a phylogenetically (more or less) correct relation . This is different from the inference problems mentioned above: Our task is not to minimize a uniform error functional but to deal with systematic distortions of the distance measurements. In order to formalize the problem setting, we assume that the evolutionary process under consideration (operating on a space ) generates an additive metric . The catch is that we have no knowledge of and we cannot directly access . We can, however, obtain partial knowledge from representations. That is, there is a function . The construction of the representation in depends on our theory of what is important about the evolving system. In molecular phylogenetics, may be chosen to be a space of sequences. In classical, morphology-based phylogenetics, the elements of are character-based descriptions of animals; attempts to use molecular structures for phylogenetic purposes might use RNA secondary structures or labeled graph representations of protein 3D structures; a historic linguist might choose word lists or grammatical features. Once we have decided on representations, we can turn to measuring (dis)similarities between them. The concrete choice of a distance measure of course again depends on the theoretical conception of the underlying evolutionary process. We can easily reinterpret as a distance measure on by settingIt is easy to see that is a metric whenever is a metric and is injective, i.e., whenever our representation is good enough to distinguish objects in . There is no a priori reason to make this assumption, however. Consider, for example, RNA secondary structures as a function of the primary sequences. This map is highly redundant (Schuster et al. 1994); for example, most tRNAs share the standard clover-leaf structure despite very different sequences and divergence times that pre-date the common ancestor of all extant life forms (Eigen et al. 1989); distances between secondary structures therefore do not reflect all evolutionary processes. Formally, is not a metric but only a pseudometric in this case: It does not satisfy axiom (M1) any longer. We will ignore this complication here and assume for simplicity that is a metric. The metric is of interest for phylogenetic purposes if it quantifies evolutionary divergence in a meaningful way. That is, we are concerned with the information about the underlying additive metric that can be extracted from . Without additional assumptions on the relationships between and , however, nothing much can be said. At the very least, our representation should be good enough to recognize whether one of two objects or has diverged further from a given reference point than the other. Hence, we assume that for all :In the absence of at least this very weak form of monotonicity, we cannot really hope to recover information about from measuring . To our knowledge, property (m0) has not received much attention in the past. The following, stronger condition, however, has been considered extensively:for all . This property is known as (strong) monotonicity (Kruskal 1964) and lies at the heart of non-metric multi-dimensional scaling, a set of techniques that aim at approximating dissimilarity data by a Euclidean metric (Borg and Groenen 2005). A commonly used criterion is to minimize the violations of condition (m1). It is interesting to note in this context that, given any input metric , there is a always a Euclidean metric that is connected with by strong monotonicity, provided the embedding space is of sufficiently high dimension (Agarwal et al. 2007). In our context, it will be interesting to investigate whether there is an analogous result for additive metrics. implies . implies If we insist, in addition, that ties are preserved, i.e., that is equivalent to , then there exists an increasing function such that . In the following, we will consider this (more restrictive) setting in some detail.

Metric-preserving functions

Definition 1

A function is metric-preserving if for every metric the function is also a metric on . Consider the following properties:A theorem by Kelley (1955, p. 131) states that (Z1), (Z2), and (Z3) together are sufficient conditions for to be metric-preserving. One can show, furthermore, that (Z1) and (Z2) are necessary (Corazza 1999). Property (Z3) is sufficient but not necessary, as shown by several examples of metric-preserving functions that fail to be non-decreasing (Doboš 1998; Corazza 1999). A necessary and sufficient condition (Wilson 1935; Borsik and Doboš 1981; Das 1989) is that is amenable, (Z1), and satisfiesIt can also be shown that any concave amenable function is metric preserving (Doboš 1998). If satisfies (m0), then (Z3) holds. We therefore restrict ourselves to amenable, subadditive, non-decreasing functions. Furthermore, we assume for convenience that is continuous. if and only if (amenable) (subadditive) is non-decreasing. . Metric-preserving transformations do not preserve the relation . The distance matrix corresponds to the tree in the middle and, according to Eq. (1), satisfied . The function satisfies (Z1), (Z2), (Z3) and is smooth. The transformed distance matrix is presented by the networks shown on the r.h.s. (computed with SplitsTree (Huson and Bryant 2006). Here, is the distance pair with the shortest distance sum, i.e., it corresponds to the quadruple . This split corresponds to the longer one of the two side lengths of the box We say that is a.m.-preserving (ultrametric-preserving) if is an additive metric whenever t is an additive metric (ultrametric). It was shown recently that a function preserves ultrametricity if and only if it is amenable (Z1) and non-decreasing (Z3) (Pongsriiam and Termwuttipong 2014). In Appendix, we prove:

Lemma 1

If is a.m.-preserving, then it is also ultrametric-preserving. This implies in particular that an a.m.-preserving function is non-decreasing. It will not come as a surprise that nonlinear distortions do not preserve additivity.

Theorem 1

If is a.m.-preserving, then holds for all with . A proof can be found in Appendix. The importance of this theorem lies in the fact that any nonlinear distortion of the metric t necessarily destroys additivity and thus, depending on the algorithm employed, may result in the reconstruction of an incorrect phylogeny. Given the importance of the relation , it is natural to ask whether—or under what conditions—at least this relation is preserved. The example in Fig. 1 shows, however, that the relation is not necessarily preserved under transformations satisfying (Z1), (Z2), and (Z3). The example of Fig. 1 is reminiscent of the effect of long branch attraction (LBA) in parsimony-based methods (Felsenstein 1978; Bergsten 2005), which can also be understood the consequence of underestimating the impact of homoplasy, i.e., “back-mutations.”

Fig. 1

Metric-preserving transformations do not preserve the relation . The distance matrix corresponds to the tree in the middle and, according to Eq. (1), satisfied . The function satisfies (Z1), (Z2), (Z3) and is smooth. The transformed distance matrix is presented by the networks shown on the r.h.s. (computed with SplitsTree (Huson and Bryant 2006). Here, is the distance pair with the shortest distance sum, i.e., it corresponds to the quadruple . This split corresponds to the longer one of the two side lengths of the box

Multiple features

A reasonable approach to devise a distance measure for a set of objects is to use a representation in terms of a collection of features, i.e., to consider a product space with distance measures independently defined for each of the features. Each feature can be seen as an independent representation, , and thus, we may reinterpret the as different distance measures on X, i.e., with . In this setting, it seems most natural to assume that is just a pseudometric. It is well known that any nonnegative linear combination of pseudometric with is again a pseudometric. To avoid trivial cases, assume . Then, is a metric whenever implies that there is a feature such that . The most general ways to combine metrics are given by the generalized metric-preserving transforms, i.e., functions with the property that is a metric whenever each , , is a metric (Das 1989). These functions have a characterization that naturally generalizes (Z1) and (Z*) to multiple arguments.

Theorem 2

If transforms additive metrics consistent with the same underlying tree into a metric that is again compatible with , then where with , is a nonnegative linear combination where is the standard discrete metric applied to the component, i.e., the argument of . for each , at least one of and is nonzero.

Proof

Suppose all component metrics are discrete except for , . Then, is linear with nonnegative slope for as an immediate consequence of Theorem 1, i.e., condition (i) is necessary. Theorem 1 furthermore implies that the contribution for each feature i is necessarily of the form with . To ensure that we have a metric, each constituent must be a metric, i.e., at least one of and must be nonzero. □ In essence, Theorem 1 characterizes the distance measures that are “good” for phylogenetic purposes: These exactly are the ones that are linear combinations of distance measures that themselves are additive. In particular, therefore, alignment-free phylogenetic methods are guaranteed to work only when their distance measure approximates an additive measure, or, equivalently, when they approximate a distance for which a transformation to an additive distance is known (and used for the phylogenetic reconstruction).

Inferring transformations

The theoretical considerations above lead to the conclusion that the key problem for phylogenetic inference from data without a completely understood underlying model is to find monotonic transformations that make the original data as additive as possible before applying distance-based phylogenetic methods. It is important to realize that this is not the same problem as extracting the additive part of a given metric using, e.g., split decomposition. To see this, consider the metric distance matrixThe transformation recovers the additive metric of Fig. 1 (up to small rounding errors) and thus recovers the tree in Fig. 1. Its split decomposition, on the other hand, yields the network on the r.h.s. of the figure with isolation indices and . Any reasonable methods for fitting an additive tree thus will pick up the a quadruple with the from these distances. Consider now a function that, given a metric distance matrix as input, produced a “best-fitting” additive metric distance matrix of the same dimension as output. More formally, denote by the set of all metrics on n points, and let .

Definition 2

A function is a.m.-consistent if the following conditions are satisfied: If then is an additive metric. If is an additive metric, then . The neighbor-joining algorithm (Saitou and Nei 1987) is a well-known example of an a.m.-consistent function (Gascuel and Steel 2006). Another example is the non-prime part of the split decomposition (Bandelt and Dress 1992). Given a distance matrix and an a.m.-consistent function , a natural measure for the deviation from additivity is with some matrix norm . In particular, if and only if is an additive metric. Let us now return to Assumption A and characterize distances that derive from additive metrics in a simple manner:

Lemma 2

Let be a metric distance matrix, let be an a.m.-consistent function, suppose is invertible, increasing, and subadditive, and let be a matrix norm. Then, there is an additive distance matrix with if and only if . Invertibility of implies that is equivalent to . Now if and only if is additive. Using invertibility of again, this is in turn equivalent to . Since the matrix norm vanishes only for the 0-matrix, the Lemma follows. □ Lemma 2 immediately suggests to search for by minimizing the error functionalBy Lemma 2, derives from an additive metric if and only if a with exists. Otherwise, we obtain an approximately additive source metric that then serves as the best available input for phylogenetic reconstruction. In this case, the values of as well as the estimate that is found by minimizing will in general depend on both the a.m.-consistent function and the matrix norm . Empirical estimation of a transformation . Top: The relevant parameters and of the stretched exponential transform Eq. (5) can be estimated with the help of Eq. (4). Plotting as a function of the parameters and in Eq. (5) shows that the minimal discrepancy is indeed found at the theoretical values and used to generate the transformed distance matrix corresponding to a tree with 100 leaves. The color scale on the r.h.s. of the panel refers to ln. Below: The two small panels show the effect of increasing levels of measurement noise (left: , right: , see “Appendix 2” for details) As a proof of principle, we first produced an artificial distance matrix by transforming distance of a randomly generated tree with 100 leaves using the Jukes–Kantor rule (Jukes and Cantor 1969) corresponding to a four-letter alphabet and scaling the mutation rate such that back-mutations play a role but distances are not completely saturated. We then make the assumption that the measured data might depend on the unknown additive scale via a stretched exponential transformation of the formwith unknown parameters , , and . Figure 2(top) shows that the correct values of and can be inferred by using Eq. (4) to minimize the discrepancy . In “Appendix 2,” we show more formally that the parameter is arbitrary and hence cannot be inferred. Intuitively, this follows from the fact that only scales the time axis and hence constitutes a purely additive transformation of the distance, which canceled in Eq. (4) by the application of .

Fig. 2

Empirical estimation of a transformation . Top: The relevant parameters and of the stretched exponential transform Eq. (5) can be estimated with the help of Eq. (4). Plotting as a function of the parameters and in Eq. (5) shows that the minimal discrepancy is indeed found at the theoretical values and used to generate the transformed distance matrix corresponding to a tree with 100 leaves. The color scale on the r.h.s. of the panel refers to ln. Below: The two small panels show the effect of increasing levels of measurement noise (left: , right: , see “Appendix 2” for details)

Real-life distance data of course are not perfectly additive. We therefore simulated sequence data by introducing substitutions independently at each sequence position according to a first order Markov process along all edges of a given phylogenetic tree. In order to tune the level of noise, we considered different linear combinations of the theoretical and the simulated data, see “Appendix 2” for details. We found that the estimation of via Eq. (4) works well for small levels of sampling noise. For large noise levels, however, there are systematic biases. These appear to depend strongly on the choice of the matrix norm . Clearly, a better understanding of the numerical problems associated with this inference problem will be necessary before the conceptually simple workflow proposed here can be applied to real-life data.

Discussion and conclusions

It has been realized already in the early days of computational phylogenetics that suitable transformation of distance data, e.g., using the Jukes–Cantor transformation, can increase the additivity and thus conceivably improve the quality of phylogenetic reconstructions (Vach 1992). A main insight in this contribution is that it is, at least in principle, possible to infer the correct distance transformation from the measured data only. As a consequence, the correct inference of phylogenetic relationships is possible not only for additive distances but also for the large class of distances that arise from additive metrics with a monotonic metric-preserving function. At the same time, our results suggest that there are limits to phylogenetic inference. Whenever the available data cannot be transformed into an additive metric (at least approximately, i.e., up to measurement noise), there seems little hope to justify the interpretation of the results of hierarchical clustering (which of course can be performed on any kind of distance or similarity data) as a phylogeny. It is important to note, however, that our discussion has focused on metric-preserving functions, i.e., “uniform” transformations of the distance data. It is entirely possible to employ more general schemes that further extend the realm of phylogenetically meaningful data. For instance, the results of “Multiple features” section show that for data comprising multiple types of descriptors, distances extracted from the different subclasses c can be transformed with different functions . Such an approach might even be useful to distinguish phylogenetically informative from problematic classes of features. On a more conceptual level, our results show that detailed mechanistic models of the underlying evolutionary process are not logically necessary for phylogenetic inference. It is, in fact, sufficient that the measured distance data can be transformed to an additive metric by means of a monotonic metric-preserving function. This is not to say that a mechanistic understanding of the process is not useful or desirable. After all, a mechanistic model will, at the very least, typically imply the functional form of the transformation function . The inference of from real-world data remains an important open problem. The issue to be explored is not only the limiting effect of measurement noise and inherent deviations from additivity due to horizontal gene transfer, incomplete lineage sorting, etc., but also numerical issues such as the fact that, in large trees, a substantial fraction of all pairwise distances takes values very close to the diameter of the tree. This seems to cause a particular susceptibility to measurement noise. Systematic simulation studies well beyond the scope of this contribution will be required to address this issues. A potential alternative to Eq. (4) is the minimization of some measure of tree-likeness for the transformed matrix . Attractive candidates are the corresponding parameters of statistical geometry (Eigen et al. 1988; Nieselt-Struwe 1997) and the related “-plots” advocated by Holland et al. (2002). It is not obvious, however, how these measures react to the changes in scale invariably introduced by . This issue does not arise in the context of Eq. (4) because the effects cancel due to the appearance of both and . It is interesting to note that our results also provide an a posteriori explanation for the observation that alignment-free methods work best in phylogenetic applications when the distances correlate well with alignment-based distances (Haubold et al. 2009; Morgenstern et al. 2017; Thankachan et al. 2017). It will be interesting to see whether other types of distances, such as compression distances (Kocsor et al. 2006; Penner et al. 2011), admit a transformation that makes them approximately additive. Finally, several mathematical questions arise naturally from the results presented here. First, we may ask whether it is possible to replace condition (m1) by weaker requirements, such as (m0)? Even more generally, to what extent can arbitrary rate variations be accommodated? We know of course that they are harmless in an underlying additive metric—but what is the most general distortion that can be accommodated? Complementarily, it will be of interest to characterize the functions that preserve circular (Kalmanson 1975) and weakly decomposable metrics (Bandelt and Dress 1992), respectively.

32 in total

1. Quasi-independence, homology and the unity of type: a topological theory of characters.

Authors: Günter P Wagner; Peter F Stadler
Journal: J Theor Biol Date: 2003-02-21 Impact factor: 2.691

2. Character analysis in morphological phylogenetics: problems and solutions.

Authors: J J Wiens
Journal: Syst Biol Date: 2001 Sep-Oct Impact factor: 15.683

3. Application of phylogenetic networks in evolutionary studies.

Authors: Daniel H Huson; David Bryant
Journal: Mol Biol Evol Date: 2005-10-12 Impact factor: 16.240

4. Application of compression-based distance measures to protein sequence classification: a methodological study.

Authors: András Kocsor; Attila Kertész-Farkas; László Kaján; Sándor Pongor
Journal: Bioinformatics Date: 2005-11-29 Impact factor: 6.937

Review 5. Neighbor-joining revealed.

Authors: Olivier Gascuel; Mike Steel
Journal: Mol Biol Evol Date: 2006-07-28 Impact factor: 16.240