| Literature DB >> 22905214 |
Dapeng Xiong1, Fen Xiao, Li Liu, Kai Hu, Yanping Tan, Shunmin He, Xieping Gao.
Abstract
BACKGROUND: Horizontal gene transfer (HGT) is one of the major mechanisms contributing to microbial genome diversification. A number of computational methods for finding horizontally transferred genes have been proposed in the past decades; however none of them has provided a reliable detector yet. In existing parametric approaches, only one single compositional property can participate in the detection process, or the results obtained through each single property are just simply combined. It's known that different properties may mean different information, so the single property can't sufficiently contain the information encoded by gene sequences. In addition, the class imbalance problem in the datasets, which also results in great errors for the gene detection, hasn't been considered by the published methods. Here we developed an effective classifier system (Hgtident) that used support vector machine (SVM) by combining unusual properties effectively for HGT detection.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22905214 PMCID: PMC3419211 DOI: 10.1371/journal.pone.0043126
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of classification results obtained through 5-fold cross validation with respect to different feature subsets selected.
| All features | Optimal feature subset | |||
| Genome | Recall | Meanerror | Recall | Meanerror |
|
| 80.62 | 18.67 | 86.79 | 13.50 |
|
| 76.83 | 17.97 | 84.62 | 14.31 |
|
| 67.32 | 27.31 | 77.27 | 21.92 |
|
| 78.34 | 20.38 | 83.73 | 16.05 |
|
| 74.52 | 19.17 | 80.66 | 16.10 |
|
| 74.73 | 20.69 | 78.30 | 14.27 |
The optimal feature subsets for testing genomes.
| Feature | A | B | C | D | E | F |
| Karlin’s dinucleotide | Yes | Yes | Yes | |||
| Karlin’s codon bias | Yes | Yes | Yes | Yes | ||
| GC1–GC3 | Yes | |||||
|
| Yes | Yes | Yes | Yes | ||
|
| Yes | Yes | Yes | Yes | ||
| JS-N | Yes | Yes | Yes | |||
| JS-DN | Yes | Yes | Yes | Yes | Yes | |
| JS-CB | Yes | Yes | Yes | Yes | Yes | |
| 1-mer | Yes | |||||
| 2-mer | ||||||
| 3-mer | Yes | |||||
| 4-mer | Yes | |||||
| 5-mer | Yes | |||||
| 6-mer | Yes | |||||
| 7-mer | Yes |
A-F is E. coli K12, E. coli O157 Sakai, S. enterica Typhi CT18, S. enterica Paratypi ATCC 9150, C. pneumoniae CWL029 and S. agalactiae 2603, respectively. “Yes” indicates that the corresponding feature is included in the optimal feature subset.
Comparison of classification results obtained through class imbalance learning method with the optimal feature subsets by 5-fold cross validation.
| None (imbalanced dataset) | SMOTE (balanced dataset) | |||
| Genome | Recall | Mean error | Recall | Mean error |
|
| 86.79 | 13.50 | 92.35 | 7.26 |
|
| 84.62 | 14.31 | 89.85 | 9.64 |
|
| 77.27 | 21.92 | 86.17 | 11.36 |
|
| 83.73 | 16.05 | 91.60 | 7.64 |
|
| 80.66 | 16.10 | 83.51 | 13.70 |
|
| 78.30 | 14.27 | 87.09 | 10.42 |
The classification results of the multiple-threshold approach and Hgtident.
|
|
|
|
|
|
| |||||||
| Method | Recall | Mean error | Recall | Mean error | Recall | Mean error | Recall | Mean error | Recall | Mean error | Recall | Mean error |
| A | 81.03 | 30.54 | 79.05 | 22.94 | 66.80 | 33.99 | 89.93 | 37.32 | 69.27 | 37.49 | 72.35 | 38.57 |
| B | 75.56 | 28.09 | 77.51 | 26.19 | 71.27 | 26.99 | 82.80 | 22.99 | 67.43 | 37.33 | 76.57 | 23.12 |
| C | 42.44 | 37.98 | 47.02 | 43.10 | 33.96 | 47.27 | 45.65 | 39.51 | 40.53 | 43.15 | 37.19 | 46.60 |
| D | 64.95 | 36.56 | 67.29 | 29.42 | 71.79 | 51.44 | 78.56 | 24.23 | 72.60 | 44.09 | 69.50 | 30.51 |
| E | 61.09 | 38.71 | 61.67 | 38.72 | 61.75 | 41.90 | 71.13 | 24.26 | 73.34 | 40.56 | 80.63 | 34.73 |
| F |
|
| 86.03 | 27.45 |
|
| 82.16 | 46.98 | 68.64 | 37.76 | 77.50 | 29.67 |
| G | 86.50 | 30.92 | 77.00 | 26.20 | 78.73 | 35.25 | 79.41 | 24.14 | 77.05 | 27.10 | 67.60 | 40.06 |
| H | 90.03 | 34.32 |
|
| 70.71 | 26.89 |
|
|
|
|
|
|
| I | 87.13 | 36.27 | 47.19 | 50.51 | 34.51 | 58.81 | 88.93 | 46.27 | 46.51 | 47.38 | 37.24 | 52.63 |
| J | 43.73 | 53.22 | 49.06 | 51.64 | 46.27 | 58.77 | 46.50 | 51.87 | 47.36 | 50.60 | 41.15 | 52.76 |
| K | 43.73 | 53.05 | 49.06 | 50.83 | 41.79 | 58.49 | 51.80 | 51.90 | 47.36 | 50.73 | 41.15 | 51.32 |
| L | 63.73 | 44.29 | 50.15 | 53.19 | 35.82 | 58.71 | 51.80 | 52.45 | 53.15 | 51.19 | 35.07 | 49.67 |
| M | 43.73 | 53.34 | 52.81 | 49.75 | 35.82 | 58.50 | 46.71 | 52.00 | 48.89 | 51.32 | 54.13 | 49.75 |
| N | 51.13 | 53.37 | 47.53 | 51.38 | 44.51 | 56.80 | 46.50 | 51.97 | 47.36 | 50.38 | 46.30 | 51.69 |
| O | 43.73 | 54.17 | 49.06 | 50.92 | 44.51 | 56.80 | 48.83 | 52.21 | 49.10 | 47.41 | 49.62 | 51.33 |
| Hgtident | 92.35 | 7.26 | 89.85 | 9.64 | 86.17 | 11.36 | 91.60 | 7.64 | 83.51 | 13.70 | 87.09 | 10.42 |
A-O is multiple-threshold approach based on Karlin’s dinucleotide, Karlin’s codon bias, GC1–GC3, dinucleotide, codon bias, JS-N, JS-DN, JS-CB and k-mer (k = 1, 2, …, 7), respectively. The best Recalls and corresponding Mean errors obtained through multiple-threshold approach are depicted in bold face.
Figure 1Comparison between Hgtident and the multiple-threshold approach.
(A) Comparison of the highest Recalls. (B) Comparison of the Mean errors corresponded to the highest Recalls.