| Literature DB >> 34941890 |
Leonid Chindelevitch1, Maryam Hayati2, Art F Y Poon3, Caroline Colijn4.
Abstract
The shape of phylogenetic trees can be used to gain evolutionary insights. A tree's shape specifies the connectivity of a tree, while its branch lengths reflect either the time or genetic distance between branching events; well-known measures of tree shape include the Colless and Sackin imbalance, which describe the asymmetry of a tree. In other contexts, network science has become an important paradigm for describing structural features of networks and using them to understand complex systems, ranging from protein interactions to social systems. Network science is thus a potential source of many novel ways to characterize tree shape, as trees are also networks. Here, we tailor tools from network science, including diameter, average path length, and betweenness, closeness, and eigenvector centrality, to summarize phylogenetic tree shapes. We thereby propose tree shape summaries that are complementary to both asymmetry and the frequencies of small configurations. These new statistics can be computed in linear time and scale well to describe the shapes of large trees. We apply these statistics, alongside some conventional tree statistics, to phylogenetic trees from three very different viruses (HIV, dengue fever and measles), from the same virus in different epidemiological scenarios (influenza A and HIV) and from simulation models known to produce trees with different shapes. Using mutual information and supervised learning algorithms, we find that the statistics adapted from network science perform as well as or better than conventional statistics. We describe their distributions and prove some basic results about their extreme values in a tree. We conclude that network science-based tree shape summaries are a promising addition to the toolkit of tree shape features. All our shape summaries, as well as functions to select the most discriminating ones for two sets of trees, are freely available as an R package at http://github.com/Leonardini/treeCentrality.Entities:
Mesh:
Year: 2021 PMID: 34941890 PMCID: PMC8699983 DOI: 10.1371/journal.pone.0259877
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1(a) An example tree. In the epidemiological context, tips (a − g) would correspond to pathogen sequences and internal nodes (A − F) to their inferred common ancestors. Node D subtends a “cherry” configuration, and node C subtends two cherries (a “double cherry”). The heights of the internal nodes are 1 (E, F), 2 (C, D), 4 (B) and 8 (A), so the diameter is 16 and the Wiener index is 484, for a mean path length of 6.21. (b) Same tree with betweenness centrality values at each node (note that branch lengths do not change them). The tree has betwenness centrality 45. (c) Same tree with farness (reciprocal of closeness centrality) values at each node. The tree has closeness centrality 1/48. (d) Same tree with eigenvector centrality values (scaled to have a minimum of 1) at each node, rounded to 3 significant figures. Here, the leading eigenvalue is λ = 9.05. By our definition, the tree has eigenvector centrality 714/1023 = 0.698 (here, 10232 is the sum of the squared values).
Summary measures for phylogenetic trees.
Here, n is the number of nodes at depth i.
| Name | Description | Short form | Ref. |
|---|---|---|---|
| Numbers of small configurations | |||
| Cherry number | # of nodes with 2 tip children | cherries | [ |
| Pitchforks | # of nodes with 3 tip descendants | pitchforks | [ |
| Double cherries | # of nodes with 2 cherry children | doubcherries | new |
| 4-caterpillar | # of caterpillars with 4 tips | fourprong | [ |
| Clades of size | # of nodes with | num | [ |
| Tree-wide summaries | |||
| Colless imbalance | Colless imbalance | colless | [ |
| Sackin imbalance | Mean path length from tip to root | sackin | [ |
| Maximum height | Max # of steps from the root | maxheight | [ |
| Maximum width | Max # of nodes at the same depth | maxwidth | [ |
| Stairs | Proportion of imbalanced subtrees | stairs | [ |
| Max difference in widths | max | delW | [ |
| Node properties from network science | |||
| Betweenness centrality | # of shortest paths through node | between | [ |
| Weighted betweenness | as above, but with weighted edges | betweenW | [ |
| Closeness centrality | total distance to all other nodes | closeness | [ |
| Weighted closeness | as above, but with weighted edges | closenessW | [ |
| Eigenvector centrality | value in Perron-Frobenius vector | eigen | [ |
| Weighted eigenvector | as above, but with weighted edges | eigenW | [ |
| Summaries from network science | |||
| Diameter | largest distance between 2 nodes | diameter | [ |
| Mean pairwise distance | average distance between 2 nodes | meanpath | [ |
| Spectral properties | |||
| Min adjacency | min adjacency matrix eigenvalue >0 | minAdj | [ |
| Max adjacency | max adjacency matrix eigenvalue | maxAdj | [ |
| Min Laplacian | min Laplacian matrix eigenvalue >0 | minLap | [ |
| Max Laplacian | max Laplacian matrix eigenvalue | maxLap | [ |
| Distance Laplacian spectral properties | |||
| Max eigenvalue | largest eigenvalue in the spectrum | dLapLambdaMax | [ |
| Max density | location of largest spectral density | dLapDensityMax | [ |
| Asymmetry | skewness of the spectral density | dLapAsymmetry | [ |
| Kurtosis | peakedness of the spectral density | dLapKurtosis | [ |
Fig 2Tree summaries for the HIV/Dengue/Measles, influenza and simulations of size 100.
Fig 3Tree summaries for HIV in three settings and simulations of size 300.
Fig 4A randomly sampled tree from each scenario (except HIV in the 3-virus comparison because HIV is represented in three other trees).
To allow for focus on tree shape rather than on branch lengths, trees have been visualized with branch lengths set to 1.
Fig 5Mutual information between tree summaries and the virus type or epidemiological scenario.
Labels are as in Table 1, and the colour indicates the type of statistic as per Table 1, i.e. red: basic statistics; green: distance Laplacian spectrum statistics; blue: network science-based statistics; purple: spectral statistics.
Fig 6(a) Accuracy of classification with and without network science features, as well as with and without the distance Laplacian spectral features. (b) Feature importance in multi-class classification across all scenarios, ordered by the median. Here the importance for each feature is computed based on the mean decrease in node impurity (we use the varImp function in R, which uses the Gini index as an impurity measure). Each point is the rank of the corresponding feature in one of the classification tasks. Low ranks correspond to the most important features (i.e. the top-ranked feature has rank 1, and so on).