| Literature DB >> 22823405 |
Dan Wei1, Qingshan Jiang, Yanjie Wei, Shengrui Wang.
Abstract
BACKGROUND: Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors.Entities:
Mesh:
Year: 2012 PMID: 22823405 PMCID: PMC3443659 DOI: 10.1186/1471-2105-13-174
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description for the Data Sets
| DS1 | beta-globin | 176 | 1531 | Cytochrome P450 |
| | beta-Hemoglobin | 89 | 448 | Hemoglobin subunit |
| | integrin_alpha | 142 | 3360 | Integrin, alpha |
| | ketoacyl-synt1 | 43 | 754 | Estradiol 17-beta-dehydrogenase 8 |
| | myoglobin | 55 | 478 | Cytoglobin Myoglobin |
| | RWD | 93 | 825 | RWD domain-containing protein |
| | VCL | 92 | 2746 | Vinculin |
| | Histone | 81 | 668 | Histone |
| DS2 | HBG106679 | 22 | 446 | Copper uptake protein 2 |
| | HBG108349 | 49 | 718 | Prolactin |
| | HBG079775 | 26 | 3152 | Transcription elongation factor SPT5 |
| | HBG058842 | 34 | 1351 | TNFR superfamily member 1A |
| | HBG002834 | 92 | 951 | Calumenin/Reticulocalbin |
| | HBG050441 | 58 | 1899 | ATP-binding cassette sub-family G member |
| DS3 | HBG093787 | 32 | 1769 | Hypothetical membrane proteins |
| | HBG099893 | 34 | 430 | Putative membrane protein precursor |
| | HBG415481 | 65 | 557 | Phasin like/family protein |
| | HBG423057 | 32 | 236 | Hypothetical proteins |
| | HBG050644 | 99 | 3129 | Beta galactosidase, beta glucuronidase, Evolved beta-D-galactosidase alpha subunit |
| | HBG364776 | 48 | 1069 | Formate dehydrogenase gamma subunit precursor |
| DS4 | HBG000080 | 29 | 674 | BWK-1,CG6617-PA , Zgc:73100 C20orf11 homolog , RH01588p |
| | HBG060165 | 28 | 163 | ATP synthase, H + transporting mitochondrial F1 complex/epsilon subunit |
| | HBG010471 | 48 | 1802 | Hypothetical Glycosyl transferase, family 25/Endoplasmic reticulum targeting sequence containing protein |
| | HBG000013 | 70 | 318 | 60 S ribosomal protein L36a-like, 60 S ribosomal protein L42, L44, IP15820p, RPL |
| | HBG000026 | 18 | 3157 | Eukaryotic translation initiation factor 2-alpha kinase 3 precursor, Eukaryotic translation initiati |
| HBG065748 | 48 | 1238 | AT20832p,AT27361p, CG10513-PA, CG10514-PA, CG10550-PA, isoform A, CG10553-PA,CG10559-PA,CG10560-P |
The F-measures of the Data Sets
| KM with | 0.5738 | 0.7828 | 0.5543 | 0.6532 |
| SL with | 0.3544 | 0.4148 | 0.3307 | 0.3244 |
| CL with | 0.5153 | 0.7253 | 0.5588 | 0.516 |
| AL with | 0.5113 | 0.6956 | 0.5578 | 0.3185 |
| BKM with | 0.5725 | 0.7876 | 0.5498 | 0.6551 |
| mBKM with | 0.5882 | 0.7913 | 0.5691 | 0.6722 |
| KM with DMk | 0.7 | 0.8261 | 0.7716 | 0.8284 |
| SL with DMk | 0.601 | 0.7948 | 0.8188 | 0.6535 |
| CL with DMk | 0.7172 | 0.9295 | 0.6868 | 0.7468 |
| AL with DMk | 0.7898 | 0.9365 | 0.6963 | 0.8498 |
| BKM with DMk | 0.7346 | 0.8511 | 0.8044 | 0.8813 |
| mBKM with DMk | 0.808 | 0.9645 | 0.9143 | 0.9587 |
The similarity/dissimilarity matrix for the 10 full β-globin gene sequences based on DMk
| Human | 0 | 22.95 | 37.65 | 111.47 | 14.02 | 35.21 | 20.68 | 3.42 | 25.07 | 3.54 |
| Goat | | 0 | 41.22 | 65.70 | 18.80 | 35.05 | 33.93 | 32.36 | 6.04 | 33.05 |
| Opossum | | 0 | 42.54 | 33.29 | 64.03 | 51.64 | 46.35 | 40.41 | 49.73 | |
| Gallus | | | | 0 | 90.93 | 80.07 | 95.26 | 121.09 | 61.69 | 122.65 |
| Lemur | | | | | 0 | 21.39 | 18.50 | 17.19 | 18.12 | 18.74 |
| Mouse | | | | | | 0 | 16.04 | 33.64 | 27.60 | 37.59 |
| Rat | | | | | | | 0 | 17.69 | 30.53 | 20.58 |
| Gorilla | | | | | | | | 0 | 33.66 | 0.80 |
| Bovine | | | | | | | | | 0 | 35.46 |
| Chimpanzee | 0 | |||||||||
Figure 1The phylogenetic trees for 10 species using the full DNA sequences of β-globin.
Figure 2The distribution of F-measure as a function of the number of clusters based on the -tuple distance (The real numbers of DS1, DS2, DS3 and DS4 are 8, 6, 6, and 6, respectively).
Figure 3The distribution of F-measure as a function of the number of clusters based on DMk (The real numbers of DS1, DS2, DS3 and DS4 are 8, 6, 6, and 6, respectively).
Clustering results on the data sets listed in Table 1
| Data | F-measure | Time(s) | F-measure | Time(s) | F-measure | Time(s) |
| DS1 | 0.8080 | 6.875 | 0.4525 | 48 | 0.2713 | 39.8 |
| DS2 | 0.9645 | 1.844 | 0.7515 | 13.6 | 0.5924 | 6.4 |
| DS3 | 0.9143 | 2.375 | 0.3693 | 12.7 | 0.3157 | 17.1 |
| DS4 | 0.9587 | 1.328 | 0.5224 | 9.3 | 0.4007 | 6.8 |
(Time contains the time of similarity measuring and clustering)
Figure 4The phylogenetic trees for 10 species using the full DNA sequences of β-globin.
Figure 5The phylogenetic trees for 60 H1N1 viruses.
Figure 6The time comparison of three methods.
Figure 7The relationship between the runtime and different numbers of sequences and length of sequences.