Literature DB >> 22917190

A modified statistically optimal null filter method for recognizing protein-coding regions.

Lei Zhang¹, Fengchun Tian, Shiyuan Wang.

Abstract

Computer-aided protein-coding gene prediction in uncharacterized genomic DNA sequences is one of the most important issues of biological signal processing. A modified filter method based on a statistically optimal null filter (SONF) theory is proposed for recognizing protein-coding regions. The square deviation gain (SDG) between the input and output of the model is used to identify the coding regions. The effective SDG amplification model with Class I and Class II enhancement is designed to suppress the non-coding regions. Also, an evaluation algorithm has been used to compare the modified model with most gene prediction methods currently available in terms of sensitivity, specificity and precision. The performance for identification of protein-coding regions has been evaluated at the nucleotide level using benchmark datasets and 91.4%, 96%, 93.7% were obtained for sensitivity, specificity and precision, respectively. These results suggest that the proposed model is potentially useful in gene finding field, which can help recognize protein-coding regions with higher precision and speed than present algorithms.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22917190 PMCID： PMC5054498 DOI： 10.1016/j.gpb.2012.02.001

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Recognition of protein-coding regions has attracted much attention in recent years. Currently, different kinds of methods for locating protein-coding regions have been proposed. Conventional techniques for recognizing exons of DNA sequences include intelligent methods based on neural networks [1], hidden Markov models (HMMs) based on statistical theory [2], [3], [4], and correlation function methods [5]. Markov chain based models perform well in gene-findings for genome sequence analysis [6] and a better Markov model, which relies on a number of training gene datasets for accurate model parameters such as the first, second and fifth-order Markov models, has been well developed in comparison with other algorithms using Z-curve [7]. However, the computational speed of such models is cost-ineffective in the training process. Besides, the prior information is overconsidered in modeling. Thus, further development of more convenient and simple algorithms with acceptable accuracy is beneficial to genome sequence studies especially in the investigation of eukaryote genomes. In recent years, signal processing approaches have attracted significant attention in research of genomic sequences and genome structures, which may identify hidden periodicity and features that cannot be revealed easily by conventional statistical methods. In DNA sequences, protein-coding regions typically show a periodic character of three bases, which cannot be found in intergenic regions and introns in eukaryotes. Previous digital signal processing methods were based on the property of period-3 and include the discrete Fourier transform (DFT) [8], [9], [10], [11], short-time discrete Fourier transform (STDFT) with a sliding window [12], lengthen-shuffle DFT based on the format of the Z-curve [13] and EPND method with DNA walk sequences [14]. In addition, the band-pass digital filters, which are centered at , have also been proposed to predict protein-coding regions. These include single infinite impulse response (IIR) anti-notch filter using lattice structure [15] and multistage filters to suppress the background 1/f noise [16]. Guigo divides gene prediction methods into model-dependent and model-independent methods [17]. Model-dependent methods depend on a priori known genomic information of organisms, while model-independent methods do not. Model-independent methods, such as modified Garbor transform (MGT) [18], DSP methods [19], [20], extended Kalman filters based on symbolic dynamics [21], and a time-frequency filtering technique based on S-transform [22], can be used to identify unknown protein-coding regions of DNA sequences. However, most existing algorithms may be useful in the recognition of DNA sequences which are longer but the accuracy of recognition may be affected for shorter sequences. Recently, an exon detection algorithm using statistically optimal null filters (SONFs) has been proposed in comparison with the DFT algorithm for shorter sequences [23] and showed its feasibility in gene prediction for shorter DNA sequences. SONF, which is closely related to the Kalman filter, reduced the modeling complexity without requiring the solution of nonlinear equations of the Ricatti type which is essential in computing the gain of the Kalman filter [24]. SONF has been widely used in the seizure detection field [25]. Effectiveness and lower complexity of computing with SONF inspire us to explore SONF in-depth for detecting more favorable characteristics of genomes. For clarity, the outline of this paper has been shown as follows. First, we apply Z-curve representation to map DNA sequences into digital sequences. Second, we illustrate the basic principle of the improved model and the square deviation (SD) gain (SDG) amplification method was used to suppress the non-coding signal which is viewed as 1/f noise in this paper. The complete recursive iteration algorithm is also described, and the global procedure frame is given. Then, we describe the benchmark gene datasets, and present the prediction of coding regions with Class I and Class II amplification using the proposed algorithm. Furthermore, an evaluation measurement is performed for comparison with other gene prediction methods using the F56F11.4 sequence, HMR195 datasets and human β-globin gene, respectively. Finally, a conclusion of this paper is presented. This paper addresses the challenges in the locations of longer DNA sequences and shorter DNA sequences, respectively, using SONF without any training datasets, which is different from the Markov chain based models that require DNA sequence length and their a-priori biological information.

Methods

The improved model and structure are used to detect protein-coding regions. Also, a SDG amplification method is applied to suppress the non-coding regions. The recursive algorithm is illustrated and the global program for implementation is presented in detail. A block diagram of gene prediction model is shown in Figure 1.

Figure 1

The schematic data flow block diagram of the proposed gene prediction model structure

Digital mapping of DNA sequence

DNA sequence digitalization is the first stage in genome analysis. We describe a three-dimensional curve representation called the Z-curve to reconstruct each base [26]. Considering a DNA sequence with N bases, we calculate the cumulative numbers of the bases A, C, G and T, respectively beginning from the 1st base to the n-th base. We then obtain four positive integers A, C, G and T. The Z-curve is constructed by a group of nodes P (n = 1, 2, …, N), whose coordinates are illustrated by the following x, y, and z [26]:where the initial values A0 = C0 = G0 = T0 = 0 and x0 = y0 = z0 = 0. Also, we define thatThus, we know that a DNA sequence can be decomposed into three digital sequences (consisting of 1 or −1), which represents the distribution of purine/pyrimidine type, amino/keto type and strong/weak hydrogen bonds type along the DNA sequences, respectively [13]. To detect protein-coding region, a sliding window with a width of M samples is applied in our model, where M should be determined by the maximum exon length in protein-coding regions. To obtain a new digital DNA sequence, the window is then moved by one base overlapping for every sample interval until all the bases are embedded in the window.

Modified filtering

The basic theory of instantaneous matched filter (IMF) has been presented previously [27] and a general description of the SONF, which combines the maximum output signal-to-noise ratio (SNR) with the minimum mean square error (MMSE) criteria, is also given [24]. The modified filter method designed based on IMF and SONF is shown as follows. Consider a DNA sequence with a length of N shown as follows:where n(k) is zero-mean white Gaussian noise, and the desired signal d(k) is represented as:where V is the random variable, is the known basis function, and v(k) is the output of IMF which can be expressed as:For the least optimization of IMF output, we scale v(k) by , then the desired signal estimate can be shown by:where is white noise, and the ultimate form of d(k) is illustrated as:To determine the optimal filter, we scale the output v(k) of IMF by an unknown function , the final output y(k) of IMF is presented as:where y(k) is also called the estimate of d(k), and the output error of the filter is illustrated as:By using (3) the output error can also been represented by:For an ideal null filter, , and the error function of the filter becomes:Consider the MSE criteria, with respect to the input SNR, the optimal post-IMF scaling function can be written as:where q(k) is shown as follows:Thus, the power of the input noise should be small enough (i.e. ), then the scaling function is rewritten as:The detailed structure of the model is illustrated in Figure 2.

Figure 2

The structure of the improved filter model The part with dashed line denotes the block of instantaneous matched filter (IMF).

SDG amplification model

Non-coding regions may obscure the coding regions in prediction such that, the border information of coding regions cannot be identified accurately [28]. To suppress the non-coding regions (1/f noise), an SDG amplification method is proposed to enhance the SDG of coding regions which is recognized as our feature object of the coding regions. A related suppressing method has been introduced in which a quadratic window operation is performed [28]. The window can effectively suppress the non-coding regions while preserving the coding regions so that the coding regions can be easily recognized. However, a different window length is needed for different DNA sequences. To calculate the SDG of DNA sequence segment, we first design a SDG function as the weight scales of original output which is similar to the signal boosting method [29]. We define the SDG of coding regions as followswhere η is a smaller positive value (i.e., η = 0.2 ± 0.05) which is equal to smooth coefficient to control the sensitivity of algorithm. The SDG of non-coding regions is illustrated as follows:where should be slightly greater than 1 to control the attenuation velocity of noise level. Therefore, the final object value after Class I amplification is described as:where the gain function , G is the original SDG of coding regions. The operation on the original filter results denotes the Class I amplification, while the Class II amplification denotes the same operation on the Class I amplification results so that more signal of non-coding region will be suppressed.

Implementation of recursive algorithm

Before the implementation, we first define SD of signal X as:where E(X) denotes the expected value of signal X. The SDG between the input and output of iteration algorithm, , is recognized as the resultant object of the detection of coding regions which is defined as: Considering the period-3 property of protein-coding regions, we determine the dimensions of model as follows: The basis functions with a desired period-3 property primarily perform a good forecast property [23]. In this paper, we have considered several different selections for the parameter c of (c is an uncertain constant, i = 1, 2, 3) for generality. The three orthogonal basis functions have been illustrated as follows: Combining the introduced theory [24], the complete recursive algorithm has been presented as follows: The recursive update formula for the gain matrix in this paper originated from the matrix lemma shown below:where . The gain matrix P(k) is initialized as matrix I and v(k) is initialized as v(0) = (0, 0, 0) for iterations; where I is the identity matrix. Combining our DNA representation of Z-curve and the SDG amplification operation with the recursive algorithm, we summarize the brief global iteration implementation program for locating protein coding regions as shown in Figure 3.

Figure 3

Implementation diagram of the modified algorithm

Results and discussion

Gene datasets

To evaluate the performance of the improved filter model on the detection of protein-coding regions, we apply the iteration implementation program to the gene F56F11.4 on the Caenorhabditis elegans chromosome III which contains five known coding exons in positions 928–1039, 2528–2857, 4114–4377, 5465–5644, and 7255–7605 (GenBank accession number AF099922 [21]. In this work, one benchmark dataset from the mammalian organism HMR195 dataset has also been considered. HMR195 is a dataset of 195 sequences with exactly one complete either single-exon or multi-exon gene. HMR195 has the following characteristics: (1) the ratio of human:mouse:rat sequences is 103:82:10, (2) the mean length of the sequences in the set is 7096 bp, (3) the number of single-exon genes is 43, and the number of multi-exon genes is 152, (4) the average number of exons per gene is 4.86, (5) the mean exon length is 208 bp, the mean intron length is 678 bp and the mean coding length of a gene is 1015 bp (∼330 amino acids), and (6) the proportion of coding sequence in this dataset is 14%, of the intronic sequence 46% and of the intergenic DNA 40%.

Application of the modified model to gene datasets

The proposed algorithm is applied to three gene sequences with known fragments of exons, respectively. First, for the long sequence F56F11.4, the sliding window width is set as M = 351 according to the maximum exon length and the sliding window step is 1 bp. Figure 4 illustrates the prediction performance of the model combined with SDG algorithm under the condition that parameter c equals to 1 and 0.5, respectively. The locations of peak values are recognized as the predicted exon areas using our method. The peak values show that the SDG is larger in coding regions. It is consistent with the theory that the SNR between the coding regions (signal) and the non-coding regions (noise) is large [28]. When comparing Figure 4A with B, we observe that the plots become smoother with the decreasing of parameter c which is very similar to the smooth filter.

Figure 4

Identification of coding regions on F56F11.4 The output SDG of model for c = 1 and c = 0.5 was shown in A and B, respectively. The binary dot lines illustrate the true coding exons regions for visualization. The vertical axis shows the SDG, and the horizontal axis shows the relative base location.

Figure 5 illustrate the SDG after Class I and Class II amplification, respectively. We can see that the non-coding regions are suppressed effectively, and the accurate border locations of coding regions are visible. From Figure 6, we can see that the non-coding regions almost tend to zero after Class II amplification. Therefore, we can say that the SDG amplification method is effective in recognition of coding regions.

Figure 5

Identification of coding regions on F56F11.4 with Class I (A) and Class II (B) amplification

Figure 6

Recognition of the No. 5 mammalian sequence in HMR195 dataset after Class II enhancement using the improved model The gray regions denote the relative physical positions of CDS features.

Figure 6 illustrates the performance of the extracted 1100 bps from the No. 5 mammalian sequence in datasets HMR195. The true coding regions (57-117, 554-595, 706-839 and 1022-1067) have been shown intuitively with CDS feature in GenBank.

Model evaluation

To evaluate the validity of the proposed recognition model in C. elegans, a modified evaluation scheme is performed on the basis of the previous publications [18], [30]. A threshold th percent smaller than the SDG is viewed as the non-coding regions, and set to zero similarly as described previously [18]. The threshold value th is in the range between 1 and 99 to predict the borders of coding regions for calculating sensitivity (Sn), specificity (Sp) and the precision (P). In this paper, the best threshold th is 83. Figure 7 illustrates the nucleotide-level measures of prediction borders. The black blocks are the actual regions, and the gray blocks are predicted exon regions.

Figure 7

Evaluation of prediction accuracy at nucleotide level The black blocks represent the actual coding regions, and the gray blocks represent the predicted exonic regions. TP: true positive; FP: false positive; TN: true negative; FN: false negative.

The formulae are shown as follows To test the HMR195 data sets, correlation coefficient (CC) and approximate correlation (AC) are introduced which have been defined as True positive (TP) is the number of coding nucleotides correctly predicted as coding regions. False negative (FN) is the number of coding nucleotides predicted as non-coding regions. True negative (TN) is the number of non-coding nucleotides correctly predicted as non-coding regions. False positive (FP) is the number of non-coding nucleotides predicted as coding regions [30]. In the evaluation performance, we compared the DFT method [8], anti-notch filter and multistage filter [16], the original SONF algorithm [23] and the modified wavelet technique [18]. The results of the comparison with the method reported previously [24] are shown in Table 1. Note that the listed data of DFT, anti-notch filter, multistage filter and the signal boosting based on DFT are obtained from previous study [29] using the same evaluation method. From the table, we observe that the percentages of prediction of the proposed model are obviously superior to other DSP methods. Maximum sensitivities of 0.721, 0.703, 0.673, 0.725 and 0.88 were obtained using the DFT technique, IIR anti-notch filter, multistage filter, signal boosting based method and modified Garbor-wavelet, respectively. Compared with the original SONF algorithm, the Sn is slightly higher while the Sp and precision P are enhanced significantly except for the time frequency method based on S-transform which can achieve an accuracy of 96%. In addition, the parameters value in this paper is slightly lower than the Markov-based model with high orders. However, it cannot outweigh the advantages of this proposed method. The SDG amplification can also effectively improve the accuracy of recognition.

Table 1

Evaluation performance (in %) of different methods for the C. elegans chromosome III

Gene prediction methods	Sn	Sp	P	References
aModified model	91.4	96.0	93.7	This study
SONF model with c = 0.5	90.0	76.9	83.5	[23]
SONF with c = 1	90.0	51.7	70.8	[23]
DFT technique	72.1	39.4	89.7	Table 2 in [29]
IIR anti-notch filter	70.3	35.1	89.4	Table 2 in [29]
Multistage filter	67.3	26.6	88.5	Table 2 in [29]
Signal boosting based on DFT	72.5	47.1	91.1	Table 2 in [29]
Modified Garbor-wavelet	88.0	90.0	91.5	Table 1 in [18]
Time frequency method	88.0	98.0	96.0	Table 2 in [22]
Lengthen-shuffling FFT	78.8	79.9	79.3	Table 4 in [7]
Markov model k = 1	78.4	81.4	79.9	Table 4 in [7]
Markov model k = 2	85.4	94.5	89.9	Table 4 in [7]
Markov model k = 4	91.9	95.6	93.8	Table 4 in [7]
Markov model k = 5	92.6	95.8	94.2	Table 4 in [7]

Note:

Modified SONF model after Class II enhancement; data for Lengthen-shuffling FFT and Markov models with k = 1, 2, 4 and 5 are the best conditions where P = (Sn + Sp)/2 is used in evaluation. It is worthy noting that the values in bold face denote the superior recognitions. Sn, sensitivity; Sp, specificity; P, precision.

Also, the improved model has been used to test the HMR195 datasets in comparison with GeneMark, HMM, and FGENES programs [31]. Table 2 lists the index parameters including Sn, Sp, P, AC, and CC analyzed using the evaluation scheme. The parameters in this table are the average values of every sequence in these datasets. From Table 2, the precision of HMMgen is 93.0%, while a precision of 90.7% is obtained in this paper. We should point out that the results for existing methods were obtained from the corresponding publications. Different methods have been tested on different gene datasets, the repetitive work of the existent methods were enormous and redundant. We simply use the results from the references. Therefore, two tables have been presented for two gene datasets.

Table 2

Exon levels of HMR195 datasets from different gene finding programs

Programs of gene finding	Sn (%)	Sp (%)	CC (%)	AC (%)	P (%)
Filter model	91.7	87.8	77.9	80.3	90.7
GeneMark.HMM	87.0	89.0	83.0	84.0	88.0
HMMgene	93.0	93.0	91.0	91.0	93.0
FGENES	86.0	88.0	83.0	84.0	87.0
Genie	91.0	90.0	88.0	89.0	90.5
Morgan	75.0	74.0	69.0	70.0	74.5

Note: Data for GeneMark.HMM, HMMgene, FGENES, Genie, and Morgan were obtained from Table 1[31]. For the HMR195 datasets, the HMMgene performs the best for recognition. It is worthy noting that the values in bold face denote the superior recognitions. CC, correlation coefficient; AC, approximate correlation.

The filter model based signal processing method require neither additional biological information or trained genomic datasets for prediction of coding regions, so it can be applied to analyze unknown and novel genomes. This paper focused on identification of long DNA sequences. From the simulations (see Tables and Figures), we find out that the proposed algorithm in this paper can effectively recognize the locations of coding regions. Although not as good as that of the high order Markov model, the results obtained using the proposed algorithm are acceptable. It is worth noting that the complexity of this filter model is lower than the high order Markov-based model (see the implementation program). In addition, the model in this paper shares some similarities with Kalman filter theory. However, the computing complexity of the modified algorithm is efficiently reduced without calculating the Jacoby matrix by using partial differential and modeling the status equations. Moreover, compared with the DFT and STDFT spectrum analysis, and time frequency methods [8], [9], [10], [11], [12], [13], [14], [22], the window width M of the sliding window in this study does not require a multiple of 3 due to the power calculation of S(N/3). This paper aimed at investigating the validity and feasibility of the proposed model in genome analysis and prediction. To some extent, the Markov chain based model may be more effective in predicting coding regions by comparison with Ref. [4]. Markov model parameters were trained via a number of gene datasets, thus filter based methods cannot achieve its prediction ability and robustness, and the Markov chain model has widespread applications in many technical fields (e.g., the practicable software on line). Although the prediction accuracy using the proposed model is slightly lower than that using the efficient Markov chain model with high orders, the filter model is superior to the prediction methods based on the frequency content (e.g., DFT based techniques, IIR anti-notch filter and multistage filter) in terms of sensitivity, specificity and precision. We show that the improved filter model has reliable performance for exon prediction. We conducted the first independent comparative evaluation of the gene-finding algorithms available and designed a more convenient and simple algorithm for a broad approach to gene finding. Obtaining definitive accuracy seems to be an impossible task, since the performance of the programs is very sensitive to the datasets tested upon, as observed by many researchers. Not to mention that we have to assume that the actual coding exons were correctly annotated in the GenBank record under the “CDS” feature (annotated non-coding exons are not considered).

Conclusion

In this paper, a modified filter model is applied to detect protein-coding regions. To analyze gene sequences using signal processing theory, Z-curve representation is used to map DNA bases into digital sequences. Combined with the filter and the iteration algorithm, a sliding window is then applied for sampling gene data in order to analyze DNA sequences for predicting coding regions. We illustrated the potential use of the filter model to recognize a known DNA sequence. Our results strengthen its plausibility in detection of protein-coding regions. An advantage of the filter model is that it performs well without the limit of window width. In addition, the complex computation can be skipped without considering the Jacoby matrix. Another advantage is that the proposed filter model can achieve the identification of coding regions without any prior information about the DNA sequences. To suppress the non-coding regions and enhance the SNR between coding regions and non-coding regions, a SDG amplification model with Class I and Class II amplification is carried out on the output of the filter model. Simulation results show that the coding regions can be clearly identified. An evaluation algorithm is then performed on the two models. Results show that the improved filter model in this paper is effective in predicting protein-coding regions. However, the SNR is supposed to be infinite in this new model and the gene datasets tested in this paper are from the gene library and thus can be thought as pure signal. Certain improvements and adjustments of the filter structure and more tests on noised gene data are desired for potential applications to the genome analysis including genome prediction and signal processing.

Authors’ contributions

LZ conceived and carried out the project, and drafted the manuscript. FT supervised the study. SW participated in the design and coordination. All the authors have read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.

16 in total

1. Performance Improvement of the Goertzel Algorithm in Estimating of Protein Coding Regions Using Modified Anti-notch Filter and Linear Predictive Coding Model.

Authors: Mahsa Saffari Farsani; Masoud Reza Aghabozorgi Sahhaf; Vahid Abootalebi
Journal: J Med Signals Sens Date: 2016 Jul-Sep

1 in total

A modified statistically optimal null filter method for recognizing protein-coding regions.

Introduction

Methods

Digital mapping of DNA sequence

Modified filtering

SDG amplification model

Implementation of recursive algorithm

Results and discussion

Gene datasets

Application of the modified model to gene datasets

Model evaluation

Conclusion

Authors’ contributions

Competing interests

1. Frequency-domain analysis of biomolecular sequences.

2. Evaluation of gene-finding programs on mammalian sequences.

3. Determination of eukaryotic protein coding regions using neural networks and information theory.

4. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

5. Identification of protein coding regions using the modified Gabor-wavelet transform.

6. A generalized hidden Markov model for the recognition of human genes in DNA.

7. Finding genes in DNA with a Hidden Markov Model.

8. Z curves, an intutive tool for visualizing and analyzing the DNA sequences.

9. A new improved model-based seizure detection using statistically optimal null filter.

10. Gene finding in novel genomes.

1. Performance Improvement of the Goertzel Algorithm in Estimating of Protein Coding Regions Using Modified Anti-notch Filter and Linear Predictive Coding Model.