| Literature DB >> 26538192 |
Irene Vrbik1, David A Stephens2, Michel Roger3, Bluma G Brenner4,5.
Abstract
BACKGROUND: In the context of infectious disease, sequence clustering can be used to provide important insights into the dynamics of transmission. Cluster analysis is usually performed using a phylogenetic approach whereby clusters are assigned on the basis of sufficiently small genetic distances and high bootstrap support (or posterior probabilities). The computational burden involved in this phylogenetic threshold approach is a major drawback, especially when a large number of sequences are being considered. In addition, this method requires a skilled user to specify the appropriate threshold values which may vary widely depending on the application.Entities:
Mesh:
Year: 2015 PMID: 26538192 PMCID: PMC4634160 DOI: 10.1186/s12859-015-0791-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Plots the sorted pairwise distances and nearest neighbours with respect to a sequence X . Plots the sorted pairwise distances with respect to the first sequence from a random run taken from Simulation 1. The vertical grey line identifies k ∗ (the position in which the largest gap is observed); the vertical red line represents c (the largest gap between sorted pairwise distances); the horizontal blue line represents (the largest pairwise distance observed before the gap). The nearest neighbours are denoted by ‘N’s
Clustering results for the Gap Procedure on simulated data
| Data | Average | |||||
|---|---|---|---|---|---|---|
| Sim |
|
| Time (in sec) | # clusters | # singletons | ARI |
| 1 | 100 | 4 | 0.1108 | 4.25 | 0.04 | 0.9854 |
| 2 | 150 | 6 | 0.1370 | 6.39 | 0.04 | 0.9856 |
| 3 | 500 | 20 | 0.6073 | 22.49 | 0.13 | 0.9750 |
| 4 | 1250 | 50 | 6.6194 | 58.11 | 0.43 | 0.9694 |
The average clustering results (taken over 100 runs) obtained by the Gap Procedure when applied to the simulated data. The dissimilarity matrix was calculated using the aK80 distance formula and sequences (of length 800) were mutated according to a GTR + I + Γ model
Fig. 2Clustering Results for RAxML with T =90, T =0.6. The maximum likelihood phylogenetic tree (n=94) produced by RAxML for Simulation 1. High, (≥90) medium (50–90) and low (<50) bootstrap values are denoted by yellow, grey and white rectangles, respectively. Cluster indices are represented by coloured tip labels; singletons are denoted in black
Fig. 3Clustering Results for MrBayes with T =90, T =0.6. The maximum likelihood phylogenetic tree (n=100) produced by RAxML for Simulation 1. High, (≥90) medium (50–90) and low (<50) posterior probabilities are denoted by yellow, grey and white rectangles, respectively. Cluster indices are represented by coloured tip labels; singletons are denoted in black
Clustering results for RAxML on simulated data
| Sim |
|
| Time (in sec) | # clusters | # singletons | ARI | |
|---|---|---|---|---|---|---|---|
| RAxML | 1 | 90 | 0.3 | 2479.0 | 13 | 21 | 0.3662 |
| 2 | 90 | 0.3 | 4654.0 | 13 | 10 | 0.7054 | |
| 3 | 90 | 0.3 | 41584.6 | 61 | 33 | 0.6206 | |
| 4 | 90 | 0.3 | 271593.7 | 167 | 70 | 0.4889 | |
| 1 | 90 | 0.6 | 2479.0 | 7 | 4 | 0.8757 | |
| 2 | 90 | 0.6 | 4654.0 | 9 | 5 | 0.8945 | |
| 3 | 90 | 0.6 | 41584.6 | 24 | 6 |
| |
| 4 | 90 | 0.6 | 271593.7 | 54 | 2 |
|
The clustering results (for a single run) obtained by RAxML when applied to the simulated data. The quoted run times represent the time it takes RAxML to produce a phylogenetic tree and obtain clade support values (conducted using 100 bootstrap replicates). RAxML clusters are obtained using a clade support threshold equal to T and distance thresholds of T . The ARI scores in bold indicate which runs performed better than the average score obtained using the Gap Procedure
Clustering results for MrBayes on simulated data
| Sim |
|
| Time (in sec) | # clusters | # singletons | ARI | |
|---|---|---|---|---|---|---|---|
| MrBayes | 1 | 90 | 0.3 | 3324.7 | 13 | 3 | 0.4642 |
| 2 | 90 | 0.3 | 4243.6 | 19 | 7 | 0.5129 | |
| 3 | 90 | 0.3 | 144284.8 | 54 | 11 | 0.6565 | |
| 4 | 90 | 0.3 | 1328253.9 | 134 | 25 | 0.6269 | |
| 1 | 90 | 0.6 | 3324.7 | 8 | 2 | 0.8419 | |
| 2 | 90 | 0.6 | 4243.6 | 10 | 3 | 0.9011 | |
| 3 | 90 | 0.6 | 144284.8 | 24 | 6 |
| |
| 4 | 90 | 0.6 | 1328253.9 | 52 | 3 |
|
The clustering results (for a single run) obtained by MrBayes when applied to the simulated data. The quoted run times represent the time it takes MrBayes to estimate a phylogenetic tree with clade support (i.e., posterior probability) values. MrBayes clusters are obtained using a clade support threshold equal to T and distance thresholds of T . The ARI scores in bold indicate which runs performed better than the average score obtained using the Gap Procedure
A summary of the subset data taken from the HIV-1 sequence data
| Cluster size | ||||||
|---|---|---|---|---|---|---|
| Name | Description |
|
| 1 | 2–4 | ≥5 |
| all | Entire set | 1517 | 169 | 533 | 108 (311) | 61 (673) |
| men | Only males | 1391 | 152 | 488 | 96 (276) | 56 (627) |
| non.sing | Clustered sequences | 984 | 169 | 0 | 108 (311) | 61 (673) |
| nsm | Clustered males | 903 | 152 | 0 | 96 (276) | 56 (627) |
| big | Sequences clustered to big | 673 | 61 | 0 | 0 (0) | 61 (673) |
| mibc | Males clustered to big | 627 | 56 | 0 | 0 (0) | 56 (627) |
The total number of small and large-sized clusters are listed under the headings (2–4) and (≥5). The corresponding number of sequences belonging to each heading is given in parenthesis
a‘big’ clusters are defined to have ≥5 members
bthe number of clusters having ≥ 2 members
Clustering results for the Gap Procedure on HIV-1 data
| Subset | Time (in sec) | 1 ✓ | 1 ✗ | 2–4 | ≥5 | ARI |
|---|---|---|---|---|---|---|
| all | 10.56 sec | 237 | 16 | 244 (619) | 61 (645) | 0.9170 |
| men | 8.261 sec | 225 | 18 | 215 (536) | 60 (612) | 0.9097 |
| non.sing | 3.086 sec | – | 12 | 125 (351) | 57 (621) | 0.9325 |
| nsm | 2.470 sec | – | 11 | 109 (303) | 54 (589) | 0.9320 |
| big | 0.807 sec | – | 3 | 5 (14) | 61 (656) | 0.9523 |
| mibc | 0.634 sec | – | 3 | 5 (14) | 56 (610) | 0.9492 |
The ARI scores and running times of the Gap Procedure when performed on subsets of the HIV-1 data. The number of correctly (and incorrectly) identified singletons are listed under “1” (and “1✗”). The total number of for small and large-sized are listed under the headings (2–4) and (≥5). The corresponding number of sequences belonging to each class is given in parentheses