| Literature DB >> 24916411 |
Caroline Colijn1, Jennifer Gardy2.
Abstract
BACKGROUND AND OBJECTIVES: Whole-genome sequencing is becoming popular as a tool for understanding outbreaks of communicable diseases, with phylogenetic trees being used to identify individual transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods to infer transmission dynamics from sequence data rely on well-characterized infectious periods, epidemiological and clinical metadata which may not always be available, and typically require computationally intensive analysis focusing on the branch lengths in phylogenetic trees. We sought to determine whether the topological structures of phylogenetic trees contain signatures of the transmission patterns underlying an outbreak.Entities:
Keywords: computational modelling; evolutionary epidemiology; genomic epidemiology; machine learning
Year: 2014 PMID: 24916411 PMCID: PMC4097963 DOI: 10.1093/emph/eou018
Source DB: PubMed Journal: Evol Med Public Health ISSN: 2050-6201
Figure 1.Schematic illustration of different kinds of transmission networks. The index case is marked in grey.
Figure 2.Distribution of simple summary measures of tree topology
Figure 3.Box plots of the features used to summarize the shapes of phylogenies
Results of cross-validated classification
| KNN | Hom | SS | Chains |
|---|---|---|---|
| Baseline | 0.89 (0.03) | 0.85 (0.05) | 1.00 (0) |
| Varied parameters | 0.90 (0.01) | 0.89 (0.01) | 1.00 (0) |
| Varied sampling | 0.98 (0.01) | 0.21 (0.01) | 1.00 (0) |
| Varied both | 0.98 (0.01) | 0.20 (0.01) | 1.00 (0) |
| 10 isolates | 0 (0.001) | 0.5 (0.5) | 0.5 (0.53) |
| 20 isolates | 0.99 (0.01) | 0.01 (0.001) | 0 (0) |
Sensitivity (the true negative rate) here is the portion of homogeneous outbreaks correctly classified as homogeneous, and specificity (true positive rate) is the portion of super-spreader outbreaks correctly classified. For SVM classification, sensitivity and specificity have a trade-off, such that greater sensitivity can be achieved at the cost of reduced specificity and vice versa. Sensitivity and specificity are computed with the optimal threshold returned by matlab’s perfcurve function. The AUC captures the overall classifier quality. For KNN classification we report the portion correct by outbreak type, as there are three types. Numbers shown are mean (standard deviation) using the 10 classifiers found with 10-fold cross validation of the baseline case.
Figure 4.(a) Sensitivity of the SVM classification increases as the variability in the number of secondary cases in the outbreak increases. Variability is quantified as the ratio of the standard deviation to the mean of the numbers of secondary cases caused by an infectious case. Sensitivity is the portion of simulated outbreaks with the corresponding variability that were classed as super-spreader outbreaks; the solid line shows the mean sensitivity over the 10 SVMs produced by cross-validation and dotted lines are the mean ± the standard deviation. (b and c) ROCs for the SVM classifier based on the 11 summary metrics describing tree shape. ROC curves are a visual way to assess the classifier’s quality—a perfect classifier will obtain all the true positives and will have no false positives, giving an AUC of 1. An imperfect classifier has a trade-off, and can attain a specificity (true positive rate) of 1 at the cost of having a false-positive rate of 1 (top right corner of the plot). The ROC curve illustrates the shape of this trade-off; the higher the AUC, the higher the quality of the classifier. Guessing yields an AUC of 0.5. In b, different lines correspond to the different groups of simulations in the SVM sensitivity analysis. Panel c shows the SVM classifier’s performance when only the earliest outbreak isolates are sampled. Performance is poor with 10 isolates (black line) and better with 20 (blue line)