| Literature DB >> 22917190 |
Lei Zhang1, Fengchun Tian, Shiyuan Wang.
Abstract
Computer-aided protein-coding gene prediction in uncharacterized genomic DNA sequences is one of the most important issues of biological signal processing. A modified filter method based on a statistically optimal null filter (SONF) theory is proposed for recognizing protein-coding regions. The square deviation gain (SDG) between the input and output of the model is used to identify the coding regions. The effective SDG amplification model with Class I and Class II enhancement is designed to suppress the non-coding regions. Also, an evaluation algorithm has been used to compare the modified model with most gene prediction methods currently available in terms of sensitivity, specificity and precision. The performance for identification of protein-coding regions has been evaluated at the nucleotide level using benchmark datasets and 91.4%, 96%, 93.7% were obtained for sensitivity, specificity and precision, respectively. These results suggest that the proposed model is potentially useful in gene finding field, which can help recognize protein-coding regions with higher precision and speed than present algorithms.Entities:
Mesh:
Year: 2012 PMID: 22917190 PMCID: PMC5054498 DOI: 10.1016/j.gpb.2012.02.001
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1The schematic data flow block diagram of the proposed gene prediction model structure
Figure 2The structure of the improved filter model The part with dashed line denotes the block of instantaneous matched filter (IMF).
Figure 3Implementation diagram of the modified algorithm
Figure 4Identification of coding regions on F56F11.4 The output SDG of model for c = 1 and c = 0.5 was shown in A and B, respectively. The binary dot lines illustrate the true coding exons regions for visualization. The vertical axis shows the SDG, and the horizontal axis shows the relative base location.
Figure 5Identification of coding regions on F56F11.4 with Class I (A) and Class II (B) amplification
Figure 6Recognition of the No. 5 mammalian sequence in HMR195 dataset after Class II enhancement using the improved model The gray regions denote the relative physical positions of CDS features.
Figure 7Evaluation of prediction accuracy at nucleotide level The black blocks represent the actual coding regions, and the gray blocks represent the predicted exonic regions. TP: true positive; FP: false positive; TN: true negative; FN: false negative.
Evaluation performance (in %) of different methods for the C. elegans chromosome III
| Gene prediction methods | References | |||
|---|---|---|---|---|
| This study | ||||
| SONF model with | 90.0 | 76.9 | 83.5 | |
| SONF with | 90.0 | 51.7 | 70.8 | |
| DFT technique | 72.1 | 39.4 | 89.7 | Table 2 in |
| IIR anti-notch filter | 70.3 | 35.1 | 89.4 | Table 2 in |
| Multistage filter | 67.3 | 26.6 | 88.5 | Table 2 in |
| Signal boosting based on DFT | 72.5 | 47.1 | 91.1 | Table 2 in |
| Modified Garbor-wavelet | 88.0 | 90.0 | 91.5 | Table 1 in |
| Time frequency method | Table 2 in | |||
| Lengthen-shuffling FFT | 78.8 | 79.9 | 79.3 | Table 4 in |
| Markov model | 78.4 | 81.4 | 79.9 | Table 4 in |
| Markov model | 85.4 | 94.5 | 89.9 | Table 4 in |
| Table 4 in | ||||
| Table 4 in |
Note:
Modified SONF model after Class II enhancement; data for Lengthen-shuffling FFT and Markov models with k = 1, 2, 4 and 5 are the best conditions where P = (Sn + Sp)/2 is used in evaluation. It is worthy noting that the values in bold face denote the superior recognitions. Sn, sensitivity; Sp, specificity; P, precision.
Exon levels of HMR195 datasets from different gene finding programs
| Programs of gene finding | |||||
|---|---|---|---|---|---|
| GeneMark.HMM | 87.0 | 89.0 | 83.0 | 84.0 | 88.0 |
| FGENES | 86.0 | 88.0 | 83.0 | 84.0 | 87.0 |
| Genie | 91.0 | 90.0 | 88.0 | 89.0 | 90.5 |
| Morgan | 75.0 | 74.0 | 69.0 | 70.0 | 74.5 |
Note: Data for GeneMark.HMM, HMMgene, FGENES, Genie, and Morgan were obtained from Table 1[31]. For the HMR195 datasets, the HMMgene performs the best for recognition. It is worthy noting that the values in bold face denote the superior recognitions. CC, correlation coefficient; AC, approximate correlation.