| Literature DB >> 21687706 |
Manal Helal1, Fanrong Kong, Sharon C A Chen, Michael Bain, Richard Christen, Vitali Sintchenko.
Abstract
BACKGROUND: The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21687706 PMCID: PMC3110597 DOI: 10.1371/journal.pone.0019517
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Clusters of different Nocardia species determined by MSA comparisons.
Nocardia species clusters and their reference sequences*.
| Cluster name | Number of sequences in the cluster | ReferenceSequence Name | GenBankAccession Number | Proportion of other |
|
| 40 |
| AJ508414 | 5.00 |
|
| 21 |
| X80595 | 9.52 |
|
| 18 |
| AB212947 | 11.11 |
|
| 18 |
| X80593 | 0.0 |
|
| 17 |
| DQ659901 | 0.0 |
|
| 17 |
| X80592 | 0.0 |
|
| 12 |
| DQ659897 | 0.0 |
|
|
|
| Z82229 |
|
|
| 9 |
| AF278572 | 22.22 |
|
| 8 |
| DQ659905 | 50.0 |
|
| 8 |
| X80599 | 0.0 |
|
| 8 |
| DQ659914 | 0.0 |
|
| 7 |
| X80591 | 14.29 |
|
| 7 |
| X80598 | 0.00 |
|
| 6 |
| AF430035 | 0.00 |
|
| 6 |
|
|
|
|
| 6 |
| Z82228 | 16.67 |
|
| 5 |
| Z82218 | 40.0 |
|
| 5 |
| DQ659896 | 40.0 |
|
| 5 |
| AF430045 | 0.0 |
|
| 5 |
| AB126876 | 0.0 |
|
| 5 |
| AB092563 | 0.00 |
|
| 5 |
| AB201303 | 60.0 |
|
| 4 |
| AF430040 | 0.0 |
|
| 4 |
| AF430041 | 0.0 |
|
| 4 |
| AB097455 | 0.0 |
|
| 4 |
| AB162802 | 25.0 |
Note: Rows of Nocardia species assigned to more than 1 cluster are highlighted in grey.
*Only clusters with more than three 16S rRNA gene sequences are shown. For the full list refer to Table S3.
**One N. otitidiscaviarum 16S rRNA gene sequence in a cluster of five, suggestive of misclassification in the original submission.
***One N. asteroides 16S rRNA gene sequence in a cluster of six, other five sequences belong to N. beijingensis. Suggestive of misclassification in the original submission.
Species names of 16S rRNA gene sequences that were co-clustered with other species.
|
| Co-Clustered with |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 2Relative variability of different regions of the 16S rRNA gene sequence of Nocardia species.
Distance matrix generated with a sliding window of 200 nt. Peaks indicate highly variable regions of the gene.
Figure 3Principal component analysis (PCA) clustering using the Linear Mapping algorithm over the Euclidian distance of the PCA 1 (x-axis) and PCA 2 (y-axis) scores.
(A) and PCA 2 (x-axis) and PCA 3 (y-axis) scores (B).
Figure 4‘Mountain’ view of Nocardia species classification where clusters are represented as peaks on the 3D terrain, with the cluster number (starting from 0) pointing to the corresponding mountain peak.
The shape of each peak is a Gaussian curve, which is a rough estimate of the distribution of the data within each cluster. The volume of a peak is proportional to the number of strains contained within the cluster. The height of each peak is proportional to the cluster's internal similarity. The colour of a peak reflects the cluster's internal deviation, where red indicates low deviation where as blue indicates high deviation. Only the colour at the tip of a peak is significant, whereas all other areas colour is determined by blending to create a smooth transition. The numbers indicate the number of a cluster in the experiment (See Supplemental material for details).
Comparative performance of clustering algorithms.
| Clustering algorithm | Exact match (%) | Partial match (%) |
| Linear Mapping | 304 (83.52) | 339 (93.13) |
| Cluto | 304 (83.52) | 332 (91.21) |
| Hierarchical Clustering | 294 (80.77) | 320 (87.91) |
|
| 258+/−0.091 (70.88) | 309 (84.89) |
| PCA | 291 (79.95) | 326 (89.56) |
*Confidence Intervals can be calculated only for k-means.
**Partial match was defined as assignment of a sequence to a cluster which contains sequences of Nocardia species that match this sequence along with other species.
Performance of classification algorithms against the consensus.
| Method | Agreement with consensus, (%) | Expected frequency of predicted species |
| Simple | 92.73 | 0.669 |
| Matlab NBayes | 88.18 | 0.647 |
| Matlab | 87.27 | 0.627 |
| Alignment | 77.27 | 0.592 |
| WEKA NBayes | 57.27 | 0.475 |
| WEKA | 52.72 | 0.458 |
Note: The first column presents the accuracy of each method compared to the consensus of all methods used (i.e., the percentage of predictions made by each method that agree with the majority prediction by all methods). The second column shows the expected frequency of the classification made by each method (i.e., the mean frequency of the species predicted by each method, taken over all 110 sequences).