| Literature DB >> 34758743 |
Hassan W Kayondo1,2, Alfred Ssekagiri3,4, Grace Nabakooza4,5,6, Nicholas Bbosa7, Deogratius Ssemwanga3,7, Pontiano Kaleebu3,7, Samuel Mwalili8, John M Mango9, Andrew J Leigh Brown10, Roberto A Saenz11, Ronald Galiwango6, John M Kitayimbwa6.
Abstract
BACKGROUND: Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure.Entities:
Keywords: Classification; Host population; Phylogenetic tree; Simulation; Structured; Tree statistics; non-structured
Mesh:
Year: 2021 PMID: 34758743 PMCID: PMC8579572 DOI: 10.1186/s12859-021-04465-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A bifurcating phylogenetic tree simulated using our simulation procedure with 6 tips. Node A is the first bifurcation event. Nodes B, C, D and E are internal nodes. Nodes numbered 1 up to 6 are the tips or leaves. There is a migration event represented by node B to E. The phylogenetic tree is from a structured population. The two sub-populations are represented by green and red colours
Fig. 2Box plots for tree statistics for dataset 1. The mean values are the red points inside the boxes, the median values are the horizontal black lines inside the boxes and the outliers are the purple dots. The groups were structured (str) and non-structured (unstr). A normalised number of cherries, B normalised Colles index, C normalised Sackin index, D total cophenetic index, E ladder length, F maximum width, G maximum depth, H width to depth ratio
Fig. 3Box plots for tree statistics for dataset 2. The mean values are the red points inside the boxes, the median values are the horizontal black lines inside the boxes and the outliers are the purple dots. The groups were structured (str) and non-structured (unstr). A normalised number of cherries, B normalised Colles index, C normalised Sackin index, D total cophenetic index, E ladder length, F maximum width, G maximum depth, H width to depth ratio
Two-sample Kolmogorov-Smirnov, Cucconi and Podgor−Gastwirth tests for comparing distributions of tree statistics between populations for dataset 1. D, C and S are the statistics used in the tests
| Tree statistics | Kolmogorov-Smirnov Test | Cucconi Test | Podgor-Gastwirth Test | |||
|---|---|---|---|---|---|---|
| Number of Cherries | 0.142 | 4.997 | 0.007 | 4.951 | 0.0073 | |
| Colless index | 0.43 | 110.7951 | 142.0913 | |||
| Sackin index | 0.408 | 100.971 | 126.3 | |||
| Total cophenetic index | 0.44 | 101.987 | 127.896 | |||
| Ladder length | 0.128 | 4.668 | 0.011 | 4.567 | 0.0106 | |
| Maximum width | 0.362 | 62.15007 | 79.63062 | |||
| Maximum depth | 0.304 | 50.3996 | 56.04892 | |||
| Width-depth ratio | 0.184 | 28.19 | 29.825 | |||
Results of 10-fold cross-validated classification with computed average for the measures for baseline and sensitivity analysis.Times in seconds taken to build respective models are shown as well
| Classifier | Sensitivity | Specificity | AUC | Accuracy | Time in seconds |
|---|---|---|---|---|---|
| Baseline | 0.99 | 0.95 | 0.98 | 0.97 | 1.23 |
| Varied tree size | 0.98 | 0.88 | 0.93 | 0.93 | 1.30 |
| Varied parameters | 0.93 | 0.86 | 0.90 | 0.90 | 1.25 |
| Varied tree size & parameters | 0.99 | 0.97 | 0.99 | 0.99 | 1.18 |
| Baseline | 0.99 | 0.99 | 0.99 | 0.99 | 16.60 |
| Varied tree size | 0.98 | 0.90 | 0.95 | 0.94 | 17.64 |
| Varied parameters | 0.99 | 0.98 | 0.99 | 0.99 | 19.22 |
| Varied tree size & parameters | 0.99 | 0.99 | 0.99 | 0.99 | 16.87 |
| Baseline | 0.99 | 0.98 | 0.99 | 0.99 | 2.89 |
| Varied tree size | 0.94 | 0.86 | 0.94 | 0.90 | 2.11 |
| Varied parameters | 0.99 | 0.98 | 0.99 | 0.99 | 2.86 |
| Varied tree size & parameters | 0.99 | 0.98 | 0.99 | 0.99 | 2.27 |
| Baseline | 0.98 | 0.94 | 0.96 | 0.96 | 1.17 |
| Varied tree size | 0.96 | 0.78 | 0.85 | 0.87 | 1.37 |
| Varied parameters | 0.83 | 0.88 | 0.84 | 0.86 | 1.27 |
| Varied tree size & parameters | 0.90 | 0.85 | 0.85 | 0.88 | 1.29 |
| Baseline | 0.99 | 0.98 | 0.99 | 0.99 | 1.71 |
| Varied tree size | 0.98 | 0.92 | 0.97 | 0.95 | 1.37 |
| Varied parameters | 0.99 | 0.98 | 0.99 | 0.99 | 1.27 |
| Varied tree size & parameters | 0.78 | 0.84 | 0.87 | 0.78 | 1.29 |
Fig. 4ROC for two of the best classifiers with their corresponding confusion matrices for dataset 1. A ROC curve for SVM-radial, B ROC curve for SVM-polynomial, C confusion matrix for SVM-radial, D confusion matrix for SVM-polynomial
Execution times for tree dataset simulations and real data analysis
| Simulating tree datasets | Non-structured | Structured |
|---|---|---|
| Baseline | 8 hours, 13 minutes | 1 day, 3 hours, 39 minutes |
| Varied tree size | 8 hours, 36 minutes | 2 days, 8 hours, 40 minutes |
| Varied parameters | 7 hours, 1 minute | 1 day, 10 hours, 17 minutes |
| Varied tree size & parameters | 7 hours, 3 minutes | 1 day, 13 hours, 38 minutes |
| Real data | General population ( | Fishing communities and General population ( |
| Calculating tree statistics (250) bootstraps | 4 hours, 31 minutes | 19 hours, 3 minutes |
| Calculating tree statistics (1000) bootstraps | 18 hours, 23 minutes | 3 days, 1 hour, 42 minutes |
Fig. 5A structured host population with two sub-populations and
Parameter values used for simulating phylogenetic trees for structured and non-structured populations and their corresponding parameter values from literature
| Structured population | Sub-population 1 | Sub-population 2 |
|---|---|---|
| Basic reproductive number ( | 4.99 | 9.09 |
| Birth rate ( | 3.0639 | 4.0178 |
| Death rate ( | 0.014 | 0.042 |
| Migration rate ( | 0.3 | 0.2 |
| Number of tips ( | 350 | 200 |
| Number of trees | 250 | 250 |
| Non-structured population | ||
| | 4.99 | |
| | 0.0699 | |
| | 0.014 | |
| | 350 | |
| Number of trees | 500 | |
Definitions for the tree statistics
| Tree statistics | Definition | References |
|---|---|---|
| Cherry | Pair of leaves that is adjacent to a common ancestor node. | [ |
| Normalized number of cherries | Number of cherries divided by half the number of tips in a tree. | [ |
| Sackin index | Sum of all the number of edges from a leaf to a root for each of a leaf in a tree. | [ |
| Normalized Sackin index | Sackin index divided by | [ |
| Colless index | Sum of absolute differences between left and right hand leaves (terminal tips) subtended at each internal node of a tree, the root inclusive. | [ |
| Normalized Colless index | Colless index divided by | [ |
| Total cophenetic index | Sum of all depths of the lowest common ancestor for all pairs of leaves in a tree. | [ |
| Ladder length | Ratio of maximum number of connected internal nodes with a single descendant leaf to number of leaves in a tree. | [ |
| Maximum depth | Maximum number of edges from a leaf to a root for all the leaves in a tree. | [ |
| Maximum width | Maximum number of nodes for each possible depth of a tree. | [ |
| Width-depth ratio | Ratio of maximum width of a tree to its maximum depth. | [ |
Computed tree statistics from the illustrated tree in Fig. 1
| Tree statistics | Computed value |
|---|---|
| Number of Cherries | Cherries are formed by tips 1 & 2 and 4 & 5. So the number of cherries is 2. |
| Standardized number of Cherries | From the formula, |
| Sackin index | We consider each leaf and we count the edges to the root, e.g for leaf 1, there are 3 edges to the root. The value of the Sackin index becomes 14. |
| Standardized Sackin index | From the formula, |
| Colless index | We consider each internal node, e.g for internal node C, the difference between left and right tips subtended is 1. Adding such values for each internal node results in 2 as the Colless index. |
| Standardized Colless index | From the formula, |
| Total cophenetic index | |
| Ladder length | One internal node C has a single child descendant leaf, ladder length therefore is |
| Maximum depth | Depth for tips 1,2,3,4 & 5 are 3,3,2,3 & 3. Since 3 is the highest, it is the maximum depth. |
| Maximum width | Depth for tips 1,2,3,4 & 5 are 3,3,2,3 & 3 respectively. Depth for internal nodes A,B,C,D & E are 0,1,1,2 & 2. Depth 3 has the highest number of nodes and it is 4. Maximum width becomes 4. |
| Width-depth ratio |
Descriptions for machine learning techniques
| Machine learning technique | Description | References |
|---|---|---|
| K-nearest neighbour (KNN) | KNN classifies an object based on closest training examples in the feature space. KNN is a supervised machine learning technique where data is divided into two sets: a training and a test set. The training set is used to train the machine (learning), while the test set is used to determine the classes of the given objects (actual classification). Given an unknown sample | [ |
| Support vector machine (SVM) | SVM is both a supervised learning and a binary classification method. It finds the best separating hyperplane between two classes of the training samples in the feature space. Suppose we have | [ |
| Decision tree (DT) | DT procedure divides a data set into subdivisions basing on a set of tests that are defined at each branch or a node. From the given data, a tree is constructed which is composed of a root, internal nodes which are known as splits and a set of leaves. The leaves are the terminal nodes. Data are classified according to the decision framework defined by the tree. It is the leaf nodes that are assigned the label class. The assignment is done according to the leaf node into which the observation falls. The learning algorithms define splits at each internal node of a decision tree from the training data set. For an accurate decision tree, the training data should be of high quality so that the relations between the features and classes can be easily learned. | [ |