Literature DB >> 29760889

Measuring inferential importance of taxa using taxon influence indices.

Abstract

Assessing the importance of different taxa for inferring evolutionary history is a critical, but underutilized, aspect of systematics. Quantifying the importance of all taxa within a dataset provides an empirical measurement that can establish a ranking of extant taxa for ecological study and/or quantify the relative importance of newly announced or redescribed specimens to enable the disentangling of novelty and inferential influence. Here, we illustrate the use of taxon influence indices through analysis of both molecular and morphological datasets, introducing a modified Bayesian approach to the taxon influence index that accounts for model and topological uncertainty. Quantification of taxon influence using the Bayesian approach produced clear rankings for both dataset types. Bayesian taxon rankings differed from maximum likelihood (ML)-derived rankings from a mitogenomic dataset, and the highest ranking taxa exhibited the largest interquartile range in influence estimate, suggesting variance in the estimate must be taken into account when the ranking of taxa is the feature of interest. Application of the Bayesian taxon influence index to a recent morphological analysis of the Tully Monster (Tullimonstrum) reveals that it exhibits consistently low inferential importance across two recent treatments of the taxon with alternative character codings. These results lend support to the idea that taxon influence indices may be robust to character coding and therefore effective for morphological analyses. These results underscore a need for the development of approaches to, and application of, taxon influence analyses both for the purpose of establishing robust rankings for future inquiry and for explicitly quantifying the importance of individual taxa. Quantifying the importance of individual taxa refocuses debates in morphological studies from questions of character choice/significance and taxon sampling to explicitly analytical techniques, and guides discussion of the context of new discoveries.

Entities: Chemical Disease Species

Keywords: Bayesian; Tullimonstrum; taxon influence; taxon ranking; tree distance

Year: 2018 PMID： 29760889 PMCID： PMC5938459 DOI： 10.1002/ece3.3941

Source DB: PubMed Journal: Ecol Evol ISSN： 2045-7758 Impact factor: 2.912

INTRODUCTION

A fundamental question in systematics centers on understanding the importance of different taxa for understanding phylogenetic relationships. However, quantifying taxon importance has hinged on varying definitions of the term across many biological disciplines. In conservation biology and ecology, clades have traditionally been assigned values for “phylogenetic diversity [PD]” (Faith, 1992a,b) and taxa have been assigned estimates of “originality/evolutionary distinctiveness [ED]” (Pavoine, Ollier, & Dufour, 2005; Redding et al., 2008; and sources therein), both defined using combinations of character change reconstruction or branch lengths, and node counting across clades or between taxa of interest. Computational biology has built upon these definitions of importance and has cast importance in combinatorial terms employing PD and ED as measures in a constrained optimization problem (the “Noah's Ark Problem (NAP),” a subset of the knapsack problem) to solve for the amount of unique evolutionary history that can be preserved in a subset of taxa given assumptions on an amount of funding, and the relationship of funding allocated to probability of survival (Billionnet, 2013; Hartmann & Steel, 2006; Nee & May, 1997; Weitzman, 1998). In contrast to fixed‐tree approaches, in systematics, importance has been phrased in inferential terms, using sets of trees and either quartets or triplets (e.g., leaf and phylogenetic stability indices [Pol & Escapa, 2009; and sources therein]) or pruning to assess a taxon's effect on phylogenetic resolution (“wildcard” taxa [Nixon & Wheeler, 1992], “problematic” and “critical” taxa [Siddall, 1995], and later “rogue” and “unstable” taxa [Aberer, Krompass, & Stamatakis, 2013; Goloboff & Szumik, 2015]). Taxon importance measures emphasizing instability of taxa have been utilized predominantly to increase node support values by identifying and removing some subset of taxa from analyses, using various pipelines and optimality criteria (e.g., Aberer et al., 2013; Goloboff & Szumik, 2015). An alternative approach, suggested by Mariadassou, Bar‐Hen, and Kishino (2012), is instead a total‐taxa approach that assigns a value called taxon influence to all taxa within a dataset based on a leave‐one‐out taxon jackknifing and reinference procedure. This approach provides a relative measure to generate ranked lists of a full set of taxa, rather than acting as a cutoff method, like rogue taxon analysis, or on subtrees, like leaf or taxon stability indices. Because a taxon influence value is derived from independent reanalysis of the nearly complete original data compared to the full original data, it is a phylogenetic inference‐based reframing of a distinctiveness measure that is derived from a full analysis rather than partitioning of a single analysis. Additionally, the generality of taxon influence methods makes them applicable to many underassessed species, for which character data, either DNA or morphology, may be the only thing known (Mace, Gittleman, & Purvis, 2003). Furthermore, unlike ED/PD measures, taxon influence analyses do not require time‐calibrated phylogenies, which frequently necessitate a degree of knowledge of the fossil and/or biogeographic record unavailable for many groups of interest. Given this broad applicability and minimal assumptions, taxon influence approaches stand to potentially bridge the gap between definitions of importance in conservation and systematics by generating minimal‐assumption taxon rankings based on whole tree inference, which may subsequently guide the acquisition of data for clades of interest that lack the kind of information necessary for NAP approaches. Furthermore, such rank lists may be useful to track changes in character data as more analyses at phylogenomic (Bragg, Potter, Bi, & Moritz, 2016; Faircloth et al., 2012) and phenomic (e.g., Copes, Lucas, Thostenson, Hoekstra, & Boyer, 2016; Goswami, 2015; O'Leary & Kaufman, 2011) scales increase in size. Similarly, because taxon influence values are estimated for all taxa in a dataset, the relative position of a taxon of interest in the ranking of taxa may be useful for explicitly quantifying hypotheses of taxon importance implicit in many announcements of newly discovered or redescribed taxa. For example, in publications of new taxa based on phenomic data generated by tomographic methods, it remains a standard procedure to place these specimens using a parsimony analysis and to present character optimizations and contextualization of the new taxon based on its inferred position relative to other known groups on either an optimal or consensus topology (e.g., Giles, Friedman, & Brazeau, 2015; McCoy et al., 2016; Van Roy, Daley, & Briggs, 2015; Zhu et al., 2013). Such announcements are effectively verbal hypotheses of taxon importance. Despite this fact, existing inferential methods are insufficient for testing these hypotheses, because taxon importance is a relative measure that must account for both the importance of the other taxa and the effects of the characters used to infer the phylogeny. However, two problems exist with current taxon influence implementations. First, existing implementations are based on maximum likelihood, which infers a single optimized tree topology. Influence values for a taxon derived from trees estimated using ML are therefore based on a comparison of only two topologies that are assumed to be fixed estimates. These estimates thus critically neglect uncertainty—a value as important as the tree itself (Huelsenbeck & Rannala, 2004)—an omission which stands to significantly affect the inferred influence values and rankings generated by the taxon influence procedure. Second, existing taxon influence procedures discussed in Mariadassou et al. (2012) utilize either the Robinson‐Foulds metric (RF; Robinson and Foulds, 1981) or branch score difference (BSD; Kuhner and Felsenstein, 1994) to quantify differences between trees. Both values are derived from the computational literature and are agnostic to the issue of influential taxa. For example, the RF metric can produce maximal values for trivial rearrangements of a single taxon pair (Böcker, Canzar, & Klau, 2013; Lin, Rajan, & Moret, 2012), making it likely susceptible to the effects of rogue taxon behavior. The BSD, although accounting for both branch length and topological differences, is based on the RF metric and likely inherits this problem. Additionally, the interaction of differences in topology and branch lengths in the BSD may counteract one another in cases where short branch lengths and topological differences occur simultaneously (Kuhner & Felsenstein, 1994). A tree distance specific to questions of taxon influence remains an outstanding problem. To address these issues and to demonstrate the utility of taxon influence analysis for both robust ranking and taxon rank placement, we apply a modified version of the original taxon influence index (TII) approach of Mariadassou et al. (2012) to three published datasets: a complete mitogenomic dataset of reptiles (Jonniaux & Kumazawa, 2008), here referred to as JK2008, and two recently published datasets debating the placement of the unusual fossil taxon Tullimonstrum in a phylogenetic context (McCoy et al., 2016; Sallan et al., 2017). We account for tree uncertainty using a Bayesian approach to TII calculation discussed, but not implemented, by Mariadassou et al. (2012), and also present a novel tree distance to circumvent problems with the RF metric and BSD invoked in the original publication.

METHODS

Phylogenetic analyses

Bayesian phylogenetic analyses were conducted in MrBayes v.3.2.6 (Ronquist et al., 2012). The JK2008 dataset was analyzed using the same model parameterization (GTR + I + Γ) as in Mariadassou et al. (2012). Analysis was run using a single chain of 10 million generations, with a 20% burn‐in. The Tullimonstrum datasets were analyzed using the Mkv + Γ model (Lewis, 2001) with six discrete classes, using a single chain of 20 million generations, with a 50% burn‐in. In both cases, the number of generations required to reach a sufficient topological ESS was determined by calculation of approximate ESS values in the R package rwty (Warren, Geneva, & Lanfear, 2017). Because the TII approach is a single taxon‐pruning procedure, all jackknifed analyses were assigned the parent number of generations.

Taxon influence measurement

The taxon influence index (TII), the expected distance between pairs of trees in the posterior distribution, was calculated according to Mariadassou et al. (2012):where T* is the posterior distribution of trees from analysis using all taxa, T′ is a posterior distribution of trees in which a focal taxon is dropped before analysis, T ′ i is a phylogenetic tree from a posterior T′ for which taxon i was dropped before analysis, T * i is a tree from the posterior T* in which taxon i was dropped a posteriori for comparison with T ′ i, w is the posterior probability of a tree i, and d(●,●) is a topological distance between the two trees. The original calculation from Mariadassou et al. (2012) was modified in two ways. First, because comparisons between posterior distributions of trees based on pairwise distances between elements necessitate summations, where n is the number of unique postburn‐in topologies, to fully compare the high‐dimensional posterior, variance in the TII value due to a finite approximation with a smaller number of sums was estimated by resampling. For each iteration, a number of trees equal to min(|T*|,|T′|) were sampled without replacement from each posterior according to their posterior probabilities (w ), and this sample was used to calculate the TII for each of 100 iterations. The estimated TII value for each taxon was the median of these resampled values. Second, given the potential issues with both the RF metric and BSD regarding influential taxa, informative distances between trees were defined as the ratio of the distance between the trees to the size of the shared tree. This new criterion was satisfied by a value referred to here as the SPR excess, an SPR distance—the minimum number of subtree‐pruning and regrafting rearrangements required to turn one tree into another (e.g., Goloboff, 2008)—scaled by the number of taxa in the maximum agreement subtree (MAST, (Gordon, 1979; Finden & Gordon, 1985; Valiente, 2009), and see Ge, Wang, and Kim, 2005 for an example of the implications of deviation in tree shapes between a difference and similarity measure in the context of molecular data). Finally, for comparison to TII estimates, a rogue taxon analysis (Aberer, Pattengale, & Stamatakis, 2010; Aberer et al., 2013) using the Mkv + Γ model was conducted in raxml v8.2.9 (Stamatakis, 2014). To standardize the comparison to a fixed set of trees, the postburn‐in distribution of trees from the Bayesian analysis, rather than a collection of bootstrap trees, was used. All TII calculations were conducted using scripts written by the authors ([Link], [Link], [Link], [Link], [Link], [Link], [Link], [Link], [Link]) in the R environment (R Core Team 2016) using the ape (Paradis, Claude, & Strimmer, 2004), phangorn (Schliep, 2011), stringr (Wickham, 2015), and gespeR (Schmich et al., 2015) packages. Differences in taxon influence‐based rankings between the two Tullimonstrum datasets, and differences in rank by proportion of missing data, were calculated for this dataset using rank‐biased overlap (Webber, Moffat, & Zobel, 2010), for which significance was assessed using a permutation procedure against the null hypothesis of dissimilar rankings.

RESULTS

Phylogenetic trees

Phylogenetic analysis of the Jonniaux and Kumaza (JK2008) dataset demonstrated convergence in the postburn‐in tree topology (approxESST > 500) and ESS > 200 for all model parameters. The 50% majority‐rule consensus tree (Figure 1) revealed high clade support values throughout most of the tree, with low support values in the same locations as those inferred for bootstrap values by Mariadassou et al. (2012) for this dataset. The 50% majority‐rule consensus topology inferred under the Bayesian analysis was identical to that inferred by Mariadassou et al. (2012) under maximum likelihood, with the exception of the procedural collapse of the consensus tree for regions where clade support values were under 50%. There were 36 unique trees in the postburn‐in posterior distribution. The 99% credibility interval contained ten of these trees.

Figure 1

Results of Bayesian analysis of the JK2008 dataset (50% majority‐rule consensus tree), with clade credibility values displayed at nodes. The topology was the same as that recovered by Mariadassou et al. (2012) after accounting for clade collapse procedures Phylogenetic analysis of the Tullimonstrum dataset demonstrated convergence in the postburn‐in tree topology (approxESST ~ 291), and ESS > 200 for all model parameters. The postburn‐in posterior distribution contained 9877 unique trees. The 50% majority‐rule consensus tree (Figure 2) differed significantly from the parsimony analysis of McCoy et al. (2016) in several ways. First, Metaspriggina was recovered as sister to the remaining ingroup taxa, with high support (1.0). Second, Tunicata was recovered as diverging before Cephalochordata, with high support (1.0). Third, the locations of the polytomies within the tree were shifted. A clade of Haikuichthys + Myllokunmingia was recovered as sister to the remaining taxa, with low support (0.54). In the remaining taxa, Euconodonta, Gilpichthys, a clade of Myxinoidea + Myxinikela (support 0.71), and the remaining taxa were recovered in a central polytomy. Within the remaining taxa, Mayomyzon, Tullimonstrum, a clade of Priscomyzon, Pipiscius, Petromyzontida, and Mesomyzon (support 0.7) were recovered in a polytomy with the remaining taxa. The remaining taxa exhibited the same phylogenetic structure as in McCoy et al. (2016).

Figure 2

Results of Bayesian analysis of the McCoy et al. Tullimonstrum dataset (50% majority‐rule consensus tree), with clade credibility values displayed at nodes. Differences in topology between this analysis and the results of McCoy et al. are described in the text

Taxon influence values

TII analysis of the JK2008 dataset produced mostly well‐separated median values, with a small number of downwardly directed outliers, an apparent negative relationship between taxon influence and the interquartile range of the TII estimate, and no apparent relationship between TII estimate and the skewness of the distribution (Figure 3).

Figure 3

Results of taxon influence analysis of the JK2008 dataset. Rankings exhibited well‐separated medians and non‐normally distributed estimate distributions, with an apparent relationship between interquartile range and median value and downward‐directed outliers The medians of the three highest ranking taxa (Pelomedusa subrufa, Sceloporus occidentalis, and Plestiodon egregius) were well separated both from each other and from the other taxa. These taxa differed from those ranked highest by Mariadassou et al. (2012) under maximum likelihood using the BSD (Shinisaurus crocodilus, Coleonyx variegatus, and Sceloporus occidentalis). Four of the eight ingroup taxa identified as influential by Mariadassou et al. (2012) were recovered in the top of the Bayesian ranking (Geocalamus acutus, Sceloporus occidentalis, Coleonyx variegatus, and Pelomedusa subrufa). TII analysis of the Tullimonstrum datasets produced well‐separated values (Figure 4a,b, lower), with a small number of extreme and directionally biased outliers that comprised no more than 10% of each taxon's TII estimates (Figure 4a,b, upper). Both analyses placed Tullimonstrum in the lower quartile of taxon influence (median TIIMcCoy = 3.38e−05; median TIISallan = 3.07e−05) for all 27 ingroup taxa. Estimated TII values were lower in the Sallan et al. (2017) dataset than in the McCoy et al. (2016) dataset, and rankings exhibited several differences in the middle and tail of the list. However, the null hypothesis of dissimilarity in the rankings was rejected (rbo = 0.932, p < .0001). Rogue taxon analysis did not identify any taxa to be pruned. The ranking of taxa based on proportion of missing values (“?”; Figure 5) was unrelated to the estimated TII‐based ranks (rbo = 0.189; p = .829).

Figure 4

Figure 5

Scatterplot of inferred taxon influence values versus the proportion of missing data in the taxon for the McCoy et al. dataset. There was no significant relationship between influence value and proportion of missing data (R 2 = .02; p = .19). Ranking based on the proportion of missing data was unrelated to the TII‐based ranking (rbo = 0.189; p = .829)

(a) Results of taxon influence analysis of the McCoy et al. dataset. Upper plot shows full range of estimates, including upwardly directed outliers, which comprised no more than 10% of TII estimates per taxon. Lower plot shows TII rankings excluding outliers, with well‐separated median values. Tullimonstrum falls out in the lower quartile of taxon influence for all ingroup taxa. (b) Results of taxon influence analysis of the Sallan et al. dataset, with different interpretations of eight Tullimonstrum character states. Upper plot and lower plot as in Figure 4a. Tullimonstrum falls out in the lower quartile of influence values Scatterplot of inferred taxon influence values versus the proportion of missing data in the taxon for the McCoy et al. dataset. There was no significant relationship between influence value and proportion of missing data (R 2 = .02; p = .19). Ranking based on the proportion of missing data was unrelated to the TII‐based ranking (rbo = 0.189; p = .829)

DISCUSSION

Inference of well‐separated TII values for two contrasting data types—molecular data and morphological data—and for differing degrees of phylogenetic signal suggests the Bayesian‐based approach presented here is robust and applicable for ranking taxa with different data properties. Additionally, the stability in rank location of a focal taxon (Tullimonstrum) using our approach suggests the method may be beneficial for contextualizing hypotheses of the importance of individual taxa using analytical rank results.

Molecular dataset

The difference in taxon ranks between the present analysis and the original ML analysis underscores the important distinction between the two methods. Although the two approaches exhibited some overlap in highly ranked taxa (Figure 3; Mariadassou et al., 2012; Figures 4 and 5), the interquartile ranges around the median TII estimates in the present analysis reveal that TII‐based taxon rankings are likely to be significantly influenced by topological uncertainty. This implication is supported by the apparent disconnect between the strongly peaked posterior distribution, suggesting strong phylogenetic signal, and the variably wide and skewed shapes of the TII distributions for each taxon in the JK2008 dataset. The importance of the shape of TII distributions is further underscored by the overlap in interquartile range of the three highest ranking taxa with the medians of those taxa identified as highest rank by Mariadassou et al. (2012). Alternatively, such distributions may reflect an analytical artifact, such as model choice. Mariadassou et al. (2012) observed some differences among TII estimates and rankings when different models were employed on an amino acid dataset. Although there was an apparent relationship between TII median and interquartile range (Figure 3), it is currently unclear whether this variation is itself a feature of taxa that may reflect some degree of rogue behavior, or whether it is a computational artifact that may change with the number of TII sampling replicates, or with MCMC search intensity or model choice. We presented 100 iterations as a starting value for resampling the TII estimates, but more may be necessary for certain datasets. However, given the potential of Bayesian methods for obviating model selection through procedures like reversible‐jump MCMC (Ronquist et al., 2012), which is not applicable in commonly used maximum likelihood phylogenetic inference programs, model choice may not affect Bayesian TII estimates and rankings as strongly.

Morphological dataset

The ranking of Tullimonstrum using methods like taxon influence is significant because it reframes the debate in the recent literature on the taxon (Clements et al., 2016; McCoy et al., 2016; Sallan et al., 2017) from a conceptual one of character choice/significance and taxon sampling to an explicitly analytical one of the inferential importance of the taxon relative to other taxa. Specifically, based on the present results (Figures 4 and 5), we conclude that, relative to the selected taxa and characters, Tullimonstrum does not have a significant effect on our inference of the shape of evolutionary history; it is not inferentially important relative to the dataset. The hypothesis that Tullimonstrum is important, implied in the original paper (McCoy et al., 2016), is by this measure rejected, a conclusion that is further supported by the robustness of the Tullimonstrum rank position to differences in the coding of eight disputed character states between the McCoy et al. and Sallan et al. datasets. This conclusion stands in contrast to the intuitive idea of importance as suggested by the many apparently unique features in the taxon, notably the proboscis and eyestalks, and its placement within lampreys in parsimony analysis (McCoy et al., 2016). This unexpected outcome reveals that a distinction must be made between novelty and inference when contextualizing new taxonomic discoveries or redescriptions. Taxon influence analysis makes this distinction possible by explicitly quantifying one element (inferential importance). Additionally, given that the taxon influence measure accounts for taxon and character sampling, the low rank inferred for Tullimonstrum may be an artifact of sampling design incurred by adding new taxa to existing character matrices (see, for example, [Davis, Finarelli, & Coates, 2012; Zhu et al., 2013; Giles et al., 2015] and [McCoy et al., 2016; Morris & Caron, 2014; Sansom, Freedman, Gabbott, Aldridge, & Purnell, 2010]). Future work may utilize taxon influence measures to address the idea of refinability in morphological character datasets.

Methodological implications and future directions

The bounds around the resampling results (Figures 3 and 4) suggest that the finite sum approximation utilized in this study generates reproducible rankings of taxon influence and may thus be an effective approximation for calculating taxon influence based on the SPR excess distance measure, from posterior distributions of trees for which the probabilities were calculated using the standard approach. The causes for the existence of directional outliers in the studied datasets (Figures 3 and 4) is currently unclear, but may be an artifact of either the number of finite sums, or of an interaction between the probabilities of trees and the SPR distances between them. Although we have focused on several standard parametric models for nucleotide substitution and morphological character transformation, other posterior distributions of trees are possible. It may be useful to, for example, explore the distribution of parsimony‐score‐ranked trees under the Bayesian approach using the TS97/no common mechanism model (Tuffley & Steel, 1997), for comparison with the results of parametric models, or as a heuristic for larger datasets. It may also prove worthwhile to calculate tree posteriors using information‐theoretic measures (Larget, 2013; Lewis et al., 2016) and nonuniform tree priors, which may reveal a more universal metric for taxon influence assessment. Finally, although our method assesses the influence of individual taxa using a leave‐one‐out jackknifing approach as an intuitive method for generating ranked lists of taxa based on what is essentially the “main effect” of each taxon, the contributions of higher‐order “interaction” effects, such as pairwise‐ or clade‐based influence, have yet to be addressed by the taxon influence approach. Approaches for estimating clade stability have been discussed by several authors, including Pol and Escapa (2009), for reduced positional congruence, and Gatesy (2000), for linked branch support. In these cases, analyses were conducted on complete‐taxon datasets and sets of most parsimonious trees, rather than via a taxon jackknifing approach. The theory for, and effect of, pairwise or higher‐order interactions on taxon influence values is currently unclear. Future work expanding the taxon influence method through a leave‐k‐out approach may be beneficial, although direct interpretation of the results of complex multi‐taxon interaction may be difficult.

CONFLICT OF INTERESTS

We declare we have no competing interests.

AUTHOR CONTRIBUTIONS

J.S.S.D. conceived the study, wrote the original taxon influence code, and drafted the manuscript. E.W.G. wrote the tree parser, refined the code, ran cluster‐based analyses, and edited the manuscript. Both authors approved submission.

DATA ACCESSIBILITY

All scripts and functions for calculating taxon influence and associated analyses are provided as [Link], [Link], [Link], [Link], [Link], [Link], [Link], [Link], [Link] accompanying this paper.

ETHICAL STATEMENT

The research complies with all national and international ethical requirements. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

34 in total

1. A likelihood approach to estimating phylogeny from discrete morphological character data.

Authors: P O Lewis
Journal: Syst Biol Date: 2001 Nov-Dec Impact factor: 15.683

2. A metric for phylogenetic trees based on matching.

Authors: Yu Lin; Vaibhav Rajan; Bernard M E Moret
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2012 Jul-Aug Impact factor: 3.710

3. Molecular phylogenetic and dating analyses using mitochondrial DNA sequences of eyelid geckos (Squamata: Eublepharidae).

Authors: Pierre Jonniaux; Yoshinori Kumazawa
Journal: Gene Date: 2007-10-05 Impact factor: 3.688