Martin R Smith1. 1. Department of Earth Sciences, Lower Mountjoy, Durham University, Durham DH1 3LE, UK.
Abstract
MOTIVATION: The Robinson-Foulds (RF) metric is widely used by biologists, linguists and chemists to quantify similarity between pairs of phylogenetic trees. The measure tallies the number of bipartition splits that occur in both trees-but this conservative approach ignores potential similarities between almost-identical splits, with undesirable consequences. 'Generalized' RF metrics address this shortcoming by pairing splits in one tree with similar splits in the other. Each pair is assigned a similarity score, the sum of which enumerates the similarity between two trees. The challenge lies in quantifying split similarity: existing definitions lack a principled statistical underpinning, resulting in misleading tree distances that are difficult to interpret. Here, I propose probabilistic measures of split similarity, which allow tree similarity to be measured in natural units (bits). RESULTS: My new information-theoretic metrics outperform alternative measures of tree similarity when evaluated against a broad suite of criteria, even though they do not account for the non-independence of splits within a single tree. Mutual clustering information exhibits none of the undesirable properties that characterize other tree comparison metrics, and should be preferred to the RF metric. AVAILABILITY AND IMPLEMENTATION: The methods discussed in this article are implemented in the R package 'TreeDist', archived at https://dx.doi.org/10.5281/zenodo.3528123. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: The Robinson-Foulds (RF) metric is widely used by biologists, linguists and chemists to quantify similarity between pairs of phylogenetic trees. The measure tallies the number of bipartition splits that occur in both trees-but this conservative approach ignores potential similarities between almost-identical splits, with undesirable consequences. 'Generalized' RF metrics address this shortcoming by pairing splits in one tree with similar splits in the other. Each pair is assigned a similarity score, the sum of which enumerates the similarity between two trees. The challenge lies in quantifying split similarity: existing definitions lack a principled statistical underpinning, resulting in misleading tree distances that are difficult to interpret. Here, I propose probabilistic measures of split similarity, which allow tree similarity to be measured in natural units (bits). RESULTS: My new information-theoretic metrics outperform alternative measures of tree similarity when evaluated against a broad suite of criteria, even though they do not account for the non-independence of splits within a single tree. Mutual clustering information exhibits none of the undesirable properties that characterize other tree comparison metrics, and should be preferred to the RF metric. AVAILABILITY AND IMPLEMENTATION: The methods discussed in this article are implemented in the R package 'TreeDist', archived at https://dx.doi.org/10.5281/zenodo.3528123. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Gabriela Martínez de la Escalera; Angel M Segura; Carla Kruk; Badih Ghattas; Frederick M Cohan; Andrés Iriarte; Claudia Piccini Journal: Appl Environ Microbiol Date: 2021-11-24 Impact factor: 5.005
Authors: Lucas G Kiazim; Rebecca E O'Connor; Denis M Larkin; Michael N Romanov; Valery G Narushin; Evgeni A Brazhnik; Darren K Griffin Journal: Cells Date: 2021-02-09 Impact factor: 6.600
Authors: Jennifer E Jones; Valerie Le Sage; Gabriella H Padovani; Michael Calderon; Erik S Wright; Seema S Lakdawala Journal: Elife Date: 2021-08-27 Impact factor: 8.140