| Literature DB >> 22344692 |
Samuel S Shepard1, Andrew McSweeny, Gursel Serpen, Alexei Fedorov.
Abstract
Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied to sequences lacking the genetic code such as ncRNAs and 5'-untranslated regions.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22344692 PMCID: PMC3367190 DOI: 10.1093/nar/gks154
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The abstraction of nucleotides into binary sequences starting at abstraction level 0 (no abstraction) up to binary abstraction level 4 (BA4). A sliding window for the homogeneous MM algorithm shows the information being processed on the binary level as well as the effective nucleotide coverage.
A diverse selection of abstraction schemes are shown with their original accuracies versus their SVM optimized accuracies
| Abstraction rule | Original | SVM-optimized | ||||
|---|---|---|---|---|---|---|
| %EX | %IN | %EX | %IN | |||
| BA1-best | 76% | 78% | 0.77 | 93% | 72% | 0.79 |
| BA2-best | 75 | 85 | 0.80 | 93 | 79 | 0.84 |
| BA3-best | 77 | 87 | 0.81 | 94 | 81 | 0.86 |
| BA4-best | 79 | 88 | 0.83 | 95 | 82 | 0.87 |
| 74 | 71 | 0.72 | 90 | 78 | 0.83 | |
| Pos. splicing | 71 | 82 | 0.76 | 94 | 72 | 0.80 |
| GT-rich (BA3) | 66 | 83 | 0.73 | 94 | 72 | 0.80 |
| Dupl. method | 76 | 85 | 0.80 | 94 | 76 | 0.83 |
| YR method | 79 | 66 | 0.72 | 94 | 70 | 0.78 |
| log2(AMI) | n/a | n/a | n/a | 70 | 89 | 0.78 |
| Nt. MM5 | 84 | 82 | 0.83 | 93 | 78 | 0.84 |
The SVM used a non-homogeneous polynomial kernel of degree 3 with normalization. The homogeneous Markov model of order 5 and the log average mutual information is also listed for comparison. Accuracies are listed as the percent correctly predicted exon, introns or M-value (which combines exon and intron accuracy, see ‘Methods’ section). Without SVM utilization, there was no pre-set decision boundary between introns & exons for AMI, making classification tests not applicable (n/a).
The prediction accuracy of all 10 classifiers combined under a SVM polynomial kernel of degree 3
| Order of MM | %Exon | %Intron | |
|---|---|---|---|
| 2 | 94.7 | 93.9 | 0.943 |
| 3 | 95.5 | 93.9 | 0.947 |
| 4 | 96.0 | 93.8 | 0.948 |
| 5 | 96.0 | 94.7 | 0.953 |
| 6 | 96.2 | 94.2 | 0.951 |
| 7 | 96.1 | 93.8 | 0.948 |
| 8 | 96.1 | 93.6 | 0.947 |
| 9 | 96.1 | 92.6 | 0.941 |
| 10 | 95.8 | 92.7 | 0.941 |
Accuracies are listed as the percent of correctly predicted exons, introns or M-value (which combines exon and intron accuracy, see ‘Methods’ section).