| Literature DB >> 25036549 |
Chenglong Yu1, Mo Deng2, Lu Zheng3, Rong Lucy He4, Jie Yang5, Stephen S-T Yau6.
Abstract
Intron-containing and intronless genes have different biological properties and statistical characteristics. Here we propose a new computational method to distinguish between intron-containing and intronless gene sequences. Seven feature parameters α, β, γ, λ, θ, φ and σ based on detrended fluctuation analysis (DFA) are fully used, and thus we can compute a 7-dimensional feature vector for any given gene sequence to be discriminated. Furthermore, support vector machine (SVM) classifier with Gaussian radial basis kernel function is performed on this feature space to classify the genes into intron-containing and intronless. We investigate the performance of the proposed method in comparison with other state-of-the-art algorithms on biological datasets. The experimental results show that our new method significantly improves the accuracy over those existing techniques.Entities:
Mesh:
Year: 2014 PMID: 25036549 PMCID: PMC4103774 DOI: 10.1371/journal.pone.0101363
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The seven feature parameters of 12 sample genes.
|
|
|
|
|
|
|
|
|
| A00033 | 0.386 | 0.500 | 0.491 | 0.9933 | 0.9702 | 0.9415 | 0.2839 |
| A17677 | 0.508 | 0.544 | 0.565 | 1.0757 | 1.0307 | 0.8789 | 0.2588 |
| A11542 | 0.422 | 0.470 | 0.497 | 1.0431 | 1.0104 | 0.8742 | 0.2876 |
| A22239 | 0.465 | 0.495 | 0.480 | 1.0673 | 1.0353 | 0.9351 | 0.2951 |
| A24782 | 0.513 | 0.572 | 0.634 | 1.1773 | 1.1755 | 1.1214 | 0.3233 |
| Z31371 | 0.500 | 0.526 | 0.558 | 1.0233 | 1.0336 | 1.0200 | 0.2732 |
| M28289 | 0.576 | 0.635 | 0.632 | 1.3088 | 1.2991 | 1.0684 | 0.3462 |
| V01510 | 0.575 | 0.617 | 0.716 | 1.3347 | 1.2171 | 1.0992 | 0.3429 |
| U25810 | 0.560 | 0.657 | 0.623 | 1.2284 | 1.2740 | 1.1454 | 0.3340 |
| M13580 | 0.480 | 0.622 | 0.624 | 1.2339 | 1.2114 | 1.0537 | 0.3337 |
| U06674 | 0.507 | 0.498 | 0.555 | 1.1821 | 1.1609 | 0.9518 | 0.3274 |
| J02989 | 0.518 | 0.577 | 0.537 | 1.2192 | 1.2463 | 1.0363 | 0.3495 |
Figure 1Linearity of log-log plots of three feature parameters , , and based on gene Z31371.
Figure 2Linearity of log-log plots of three feature parameters , , and based on gene A10909.
Prediction results of different methods on 2000 mixed prokaryotic and eukaryotic genes (%).
| Methods | 1 2 3 4 5 | average |
| GENSCAN | 76.50 74.00 76.75 78.25 77.50 |
|
| N-SCAN | 82.50 81.75 83.75 80.25 81.50 |
|
| Z-Curve | 88.75 87.25 85.25 83.75 85.75 |
|
| DFA7 | 94.75 93.50 92.75 91.75 92.50 |
|
Figure 3The accuracy comparison of DFA7 and other three methods on 2000 mixed prokaryotic and eukaryotic genes.
Prediction results of different methods on 1000 eukaryotic genes.
| DFA7 Method | Z-Curve Method | |
| Average error counts on 800 training genes | 148.4 | 217.0 |
| Average error counts on 200 testing genes | 50.4 | 58.8 |
| Average error counts on total dataset | 198.8 | 275.8 |
| Average accuracy rate | (1000−198.8)/1000 = 80.12% | (1000−275.8)/1000 = 72.42% |
Prediction results of GENSCAN on 1000 eukaryotic genes.
| GENSCAN-Vertebrate | GENSCAN-Maize | |
| Partition 1 | 31 | 80 |
| Partition 2 | 38 | 72 |
| Partition 3 | 31 | 78 |
| Partition 4 | 33 | 57 |
| Partition 5 | 42 | 87 |
| Total error counts | 175 | 374 |
| Average accuracy rate | (1000−175)/1000 = 82.50% | (1000−374)/1000 = 62.60% |
Prediction results of different methods on 1200 eukaryotic genes.
| DFA7 Method | Z-Curve Method | |
| Average error counts on 960 training genes | 204 | 266.2 |
| Average error counts on 240 testing genes | 62.6 | 74.4 |
| Average error counts on total dataset | 266.6 | 340.6 |
| Average accuracy rate | (1200−266.6)/1200 = 77.78% | (1200−340.6)/1200 = 71.62% |
Prediction results of 1000 eukaryotic genes based on DFA7 method by one-by-one feature deletion testing.
| All 7 parameters | deleting | deleting | deleting | deleting | deleting | deleting | deleting | |
| Average error counts on 500 training genes | 72.2 | 89.4 | 99.4 | 87.6 | 76.0 | 68.2 | 88.6 | 73.4 |
| Average error counts on 500 testing genes | 130.0 | 131.4 | 126.0 | 141.2 | 129.4 | 132.8 | 128.4 | 129.2 |
| Average error counts on total dataset | 202.2 | 220.8 | 225.4 | 228.8 | 205.4 | 201.0 | 217.0 | 202.6 |
| Average accuracy rate | 79.78% | 77.92% | 77.46% | 77.12% | 79.46% | 79.90% | 78.30% | 79.74% |
Error counts for 800 eukaryotic genes with 5 partitions on 3 different machine learning methods.
| SVM7 | SVM3 | BPN3 | RBFN3 | RBFN7 | |
| Partition 1 | 170 | 223 | 298 | 296 | 246 |
| Partition 2 | 178 | 211 | 319 | 287 | 279 |
| Partition 3 | 129 | 213 | 306 | 286 | 235 |
| Partition 4 | 151 | 216 | 315 | 274 | 232 |
| Partition 5 | 114 | 222 | 306 | 284 | 237 |
| Average error counts | 148.4 | 217.0 | 308.8 | 285.4 | 245.8 |
Error counts for 200 eukaryotic genes with 5 partitions on 3 different machine learning methods.
| SVM7 | SVM3 | BPN3 | RBFN3 | RBFN7 | |
| Partition 1 | 53 | 54 | 80 | 66 | 60 |
| Partition 2 | 46 | 61 | 75 | 69 | 56 |
| Partition 3 | 50 | 62 | 70 | 86 | 60 |
| Partition 4 | 44 | 59 | 78 | 73 | 62 |
| Partition 5 | 59 | 58 | 73 | 71 | 65 |
| Average error counts | 50.4 | 58.8 | 75.2 | 73.0 | 60.6 |