| Literature DB >> 22587938 |
Manal Helal1, Fanrong Kong, Sharon Ca Chen, Fei Zhou, Dominic E Dwyer, John Potter, Vitali Sintchenko.
Abstract
BACKGROUND: Comparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification. However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge. The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets.Entities:
Year: 2012 PMID: 22587938 PMCID: PMC3351711 DOI: 10.1186/2042-5783-2-2
Source DB: PubMed Journal: Microb Inform Exp ISSN: 2042-5783
Figure 1Heatmaps for the Distance matrix generated by the MSA of the different datasets (a) .
Figure 2The optimal number of clusters for the different hash ranges and different number of indices per cluster for (a) .
Figure 3Largest four . Positions of cluster centroids are highlighted in square blocks, on PCA 1 and 2 as x-axis and y-axis coordinates, and linear mapping clustering results.
Figure 4The linear mapping clustering for the 109 EV71 VP1 sequences shown on the first and second PCA coordinates. Eleven clusters corresponding to the genogroups/subgenogroups are presented. The legend indicates EV71 VP1 genogroups/subgenogroups, years of isolation for sequences viruses or their deposition to GenBank, and country of origin. Relevant cluster centroids are highlighted in red. Isolate AF 119795 (&), belonging to C/B genogroups, is a result of intergenotypic recombination [41]. Abbreviations: AUS, Australia; CHN, the Chinese mainland; JAP, Japan; MAL, Malaysia; NOR, Norway; SK, South Korea; SA, South Africa; SIN, Singapore; TW, Taiwan; UK, United Kingdom; USA, United States of America.
Figure 5Three main steps of the MSA method, the second diagonal extraction, the hashing of the distance measures, and the clustering of the hash codes.