| Literature DB >> 30854550 |
Alvin X Han1,2,3, Edyth Parker3,4, Frits Scholer5, Sebastian Maurer-Stroh1,2,6, Colin A Russell3.
Abstract
Subspecies nomenclature systems of pathogens are increasingly based on sequence data. The use of phylogenetics to identify and differentiate between clusters of genetically similar pathogens is particularly prevalent in virology from the nomenclature of human papillomaviruses to highly pathogenic avian influenza (HPAI) H5Nx viruses. These nomenclature systems rely on absolute genetic distance thresholds to define the maximum genetic divergence tolerated between viruses designated as closely related. However, the phylogenetic clustering methods used in these nomenclature systems are limited by the arbitrariness of setting intra and intercluster diversity thresholds. The lack of a consensus ground truth to define well-delineated, meaningful phylogenetic subpopulations amplifies the difficulties in identifying an informative distance threshold. Consequently, phylogenetic clustering often becomes an exploratory, ad hoc exercise. Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) was developed to provide a statistically principled phylogenetic clustering framework that negates the need for an arbitrarily defined distance threshold. Using the pairwise patristic distance distributions of an input phylogeny, PhyCLIP parameterizes the intra and intercluster divergence limits as statistical bounds in an integer linear programming model which is subsequently optimized to cluster as many sequences as possible. When applied to the hemagglutinin phylogeny of HPAI H5Nx viruses, PhyCLIP was not only able to recapitulate the current WHO/OIE/FAO H5 nomenclature system but also further delineated informative higher resolution clusters that capture geographically distinct subpopulations of viruses. PhyCLIP is pathogen-agnostic and can be generalized to a wide variety of research questions concerning the identification of biologically informative clusters in pathogen phylogenies. PhyCLIP is freely available at http://github.com/alvinxhan/PhyCLIP, last accessed March 15, 2019.Entities:
Keywords: influenza; molecular epidemiology; nomenclature; pathogen; phylogenetic clustering
Mesh:
Year: 2019 PMID: 30854550 PMCID: PMC6573476 DOI: 10.1093/molbev/msz053
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Schematics of PhyCLIP workflow and inference. (A) Workflow of PhyCLIP. Apart from an appropriately rooted phylogenetic tree, users only need to provide , and FDR as the inputs for PhyCLIP. After determining the within-cluster WCL, PhyCLIP dissociates distantly related subtrees and outlying sequences that inflate the mean patristic distance () of ancestral subtrees. The ILP model is then implemented and optimized to assign cluster membership to as many sequences as possible. If a prior of cluster membership is given, this is followed by a secondary optimization to retain as much of the prior membership as is statistically supportable within the limits of PhyCLIP. Post-ILP optimization clean-up steps are taken before yielding finalized clustering output. (B) PhyCLIP considers the phylogeny as an ensemble of monophyletic subtrees, each defined by an internal node (circled numbers) subtended by a set of sequences (letters encapsulated within shaded region of the same color as the circled number). In this example, only subtrees with sequences () are considered for clustering by the ILP model but is determined from of all subtrees, including the unshaded subtrees 6–8. Only subtrees where are eligible for clustering. (C) Subtrees and , as well as sequence j9 are dissociated from subtree as they are exceedingly distant from . If sequences , , and are clustered under subtree whereas is clustered under subtree by ILP optimization, a post-ILP clean up step will remove from cluster
. 2.Influence of parameters on the clustering properties of PhyCLIP in the WHO/OIE/FAO 2015-update phylogeny. Figure A–F has the parameter set combinations ordered according to minimum cluster size, FDR and on the x-axis. The banded background and x-axis subscript numbering indicate the minimum cluster size of the parameter set. Marker color and size is indicative of the and the FDR respectively of the parameter set as indicated by the legend in figure B. (A) Total number of clusters. (B) Percentage of sequences clustered. (C) Grand mean of the pairwise patristic distance distribution. (D) Mean of the intercluster distance to all other clusters. (E) Mean within-cluster geographic distance calculated in Vicenty miles. (F) Mean within-cluster SD in collection dates.
. 3.Phylogeny of the Clade 2.1x viruses circulating in Indonesia. The WHO/OIE/FAO H5 nomenclature is annotated in black. PhyCLIP’s cluster designation is indicated in blue, corresponding to tip color. PhyCLIP’s supercluster topology is exemplified by Cluster A. The source population of the supercluster is annotated as A in pink, with tips colored yellow. The divergent descendant clusters are annotated as A.1, A.2, and A.3 respectively here. The letter A here is shorthand for its nomenclature address, 1.4.1.5.5.4.2. This nomenclature address indicates that supercluster A is the second descendant of cluster 1.4.1.5.5.4 (indicated in light purple), which in turn is the forth descendant of the source supercluster 1.4.1.5.5, indicated in red. See “Materials and Methods” section for full explanation of nomenclature addresses.
. 4.PhyCLIP’s delineation of WHO/OIE/FAO demarcated clades 2.3.2.1a (A) and 2.3.2.1c (B). Tips are colored according to PhyCLIP’s cluster designation. The tips colored in red in B are viruses that were designated as outliers by PhyCLIP’s outlier detection. Countries represented by single viruses in the cluster are indicated with an asterisk.
Benchmarking the Performance of PhyCLIP Against Widely Used Phylogenetic Clustering Tools.
| Approach | Time to Completion | Peak Memory Usage | Number of CPUs |
|---|---|---|---|
| PhyCLIP | 1 h 4 min | 2.0 GB | 8 |
| 3 h 25 min | 1.7 GB | 1 | |
| ClusterPicker | 2.8 min | 0.3 GB | 1 |
| PhyloPart | 10.6 min | 4.1 GB | 8 |