Literature DB >> 31299595

iProEP: A Computational Predictor for Predicting Promoter.

Hong-Yan Lai¹, Zhao-Yue Zhang¹, Zhen-Dong Su¹, Wei Su¹, Hui Ding¹, Wei Chen², Hao Lin³.

Abstract

Promoter is a fundamental DNA element located around the transcription start site (TSS) and could regulate gene transcription. Promoter recognition is of great significance in determining transcription units, studying gene structure, analyzing gene regulation mechanisms, and annotating gene functional information. Many models have already been proposed to predict promoters. However, the performances of these methods still need to be improved. In this work, we combined pseudo k-tuple nucleotide composition (PseKNC) with position-correlation scoring function (PCSF) to formulate promoter sequences of Homo sapiens (H. sapiens), Drosophila melanogaster (D. melanogaster), Caenorhabditis elegans (C. elegans), Bacillus subtilis (B. subtilis), and Escherichia coli (E. coli). Minimum Redundancy Maximum Relevance (mRMR) algorithm and increment feature selection strategy were then adopted to find out optimal feature subsets. Support vector machine (SVM) was used to distinguish between promoters and non-promoters. In the 10-fold cross-validation test, accuracies of 93.3%, 93.9%, 95.7%, 95.2%, and 93.1% were obtained for H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli, with the areas under receiver operating curves (AUCs) of 0.974, 0.975, 0.981, 0.988, and 0.976, respectively. Comparative results demonstrated that our method outperforms existing methods for identifying promoters. An online web server was established that can be freely accessed (http://lin-group.cn/server/iProEP/).

Entities: Chemical Disease Gene Species

Keywords: feature selection; position-correlation scoring function; promoter; pseudo k-tuple nucleotide composition; web server

Year: 2019 PMID： 31299595 PMCID： PMC6616480 DOI： 10.1016/j.omtn.2019.05.028

Source DB: PubMed Journal: Mol Ther Nucleic Acids

Introduction

In a genome, promoters are important regions of DNA that locate near the transcription start sites (TSSs) of genes. They are essentially nucleotide sequences of approximately extending dozens to hundreds base pairs upstream and downstream of the TSS. They always serve as regulatory elements for the assembly of transcription machinery, especially combining with RNA polymerase for promoting accurate initiation of transcription. Additionally, evidence has proved that promoters play crucial roles in the regulation of gene expression, such as alternative splicing, stability of transcripts, mRNA localization, and translation. The identification of promoters in a gene is an important part of the recognition of a gene’s complete structure. Hence, the mapping of promoters to genome is usually the first step in unraveling the mechanisms of gene transcriptional and expressional regulation. Therefore, research on promoter prediction is full of significance and deserves to be pushed forward. DNA elements in promoters are different between eukaryotes and prokaryotes. In eukaryotes, most protein-coding genes and some nuclear small RNAs have binding sites for RNA polymerase II. The core region of RNA polymerase II-dependent promoters usually contains several regulatory units: the TATA element, which is located 25 bp upstream of the TSS; the initiator; and the downstream promoter element (DPE), usually located 30 bp downstream of the TSS. In prokaryotes, most genes are regulated by the σ70 promoter, which contains three basic elements: the Pribnow Box with the consensus sequence 5′-TATAAT-3′ located 10 bp upstream of the TSS, the −35 region with the consensus sequence 5′-TTGACA-3′ located 35 bp upstream of the TSS, and the initiator adjacent to the TSS.5, 6 Distinct gene-regulatory mechanisms and sequence compositions among species promote us to use different methods to identify promoters in their genomes.7, 8 With the development of high-throughput sequencing technology, increasing genomes need to be annotated. It is costly, laborious, and time consuming to use experimental methods to characterize promoters, however, which promotes the development of the computational methods in promoter identification. There have been many attempts to predict promoters in different species. Some models were based on the principle of sequence similarity, and others converted the original sequences into numeric sequences and then adopted machine learning approaches to perform recognition. The latter extracted features according to various promoter properties, such as CpG content, free energy, consensus sequence, and global descriptor, and built the prediction programs based on machine learning approaches, such as Fisher’s linear discriminant, decision tree, support vector machine (SVM), Hidden Markov Model, neural network, pattern-based nearest neighbor search approach, and so on. Recently, deep learning has been used to grasp complex promoter sequence characteristics16, 17 and related bioinformatics identification problems.18, 19, 20, 21, 22 Although existing algorithms have exhibited encouraging performance, most of those predictors focused on only one species, and there is still space for prediction performance improvement. In this study, according to the steps shown in Figure 1, we developed an effective and powerful computational promoter prediction program for eukaryote and prokaryote species. We firstly collected promoter and non-promoter sequences in five species to construct the reliable benchmark datasets. The features extracted from the primary sequences were filtered according to the ability of distinguishing promoters from non-promoters by using feature selection technique. Subsequently, the optimal features were inputed into the SVM to train, test, and build models. Finally, based on the proposed model, we established a user-friendly web server iProEP, which can be freely accessed at http://lin-group.cn/server/iProEP/.

Figure 1

A Flowchart to Outline the Promoter Prediction Program Construction

Results

Optimization of Three PseKNC-Related Parameters

As indicated in the PseKNC section (see Materials and Methods), three parameters, k, λ, and ω, must be determined when using PseKNC to formulate promoters and non-promoters. In the PseKNC, the k and λ describe the short-range and long-range sequence-order effect, respectively, and ω is the weight factor to adjust the ratio of the two effects. In this work, the optimal values of the three parameters for five species can be obtained by searching the following scopes:For each species, the performances of 1,500 (5 × 30 × 10) different combinations of three parameters were examined to obtain their optimal combination that could produce best accuracy. Thus, we constructed 1,500 SVM classifiers based on 5-fold cross-validation for each species. The optimal combinations of the three parameters for five species were reported in Table 1.

Table 1

The Optimal Values of Three PseKNC Parameters for Five Species

Kingdom	Species	k	λ	ω	ACC (%)
Eukaryotes	H. sapiens	4	24	0.1	90.9
	D. melanogaster	5	9	0.1	89.5
	C. elegans	4	22	0.1	81.4
Prokaryotes	B. subtilis	4	12	0.2	83.8
Prokaryotes	E. coli	4	12	0.1	80.7

The Optimal Values of Three PseKNC Parameters for Five Species

The Ultimate Five Promoter Classifiers

By combining PseKNC with position-correlation scoring function (PCSF), promoter and non-promoter samples can be formulated by (4 + 6λ + n) dimension features. In the 4 + 6λ dimension PseKNC features, 4 reflects the DNA short-range correlation information, and 6λ describes the long-range correlation information. The position information is characterized by n dimension PCSF (see Materials and Methods). When incorporating these features into a prediction model, redundant information or noise might influence the performance of the model. Therefore, Minimum Redundancy Maximum Relevance (mRMR) combined with the increment feature selection (IFS) process was adopted to eliminate these unrelated features for improving the accuracy and robustness of promoter recognition models. Ultimately, by constructing a great number of SVM-based models and comparing these models’ performance using 5-fold cross-validation, the optimal feature subsets for five species were screened out and shown in Table 2. It is obvious that the accuracies were indeed improved after removing noise features. It was also noted that the feature dimensions for C. elegans, B. subtilis, and E. coli were dramatically decreased after feature selection. However, only 13 and 204 features were excluded for H. sapiens and D. melanogaster. The reason for these phenomena may be that promoter sequences of H. sapiens and D. melanogaster are much more complex than those of the other three species.

Table 2

The Feature Numbers and Accuracies for Five Species before and after mRMR Feature Selection

Kingdom	Species	Original Features		Optimal Features
Kingdom	Species	Feature Number	ACC (%)	Feature Number	ACC (%)
Eukaryotes	H. sapiens	423	93.4	410	93.5
	D. melanogaster	1,097	93.3	893	93.8
	C. elegans	405	94.4	65	95.6
Prokaryotes	B. subtilis	345	94.0	55	95.5
Prokaryotes	E. coli	345	92.1	44	93.2

The Feature Numbers and Accuracies for Five Species before and after mRMR Feature Selection After determining the optimal feature subsets, for convenience in subsequent comparisons, the 10-fold cross-validation was applied to seek the best SVM-related parameters (c and γ) and to evaluate those models. For H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli, the optimal values of c and γ are 2 and 2−3, 2 and 2−2, 25 and 2−1, 25 and 2−7, and 2−1 and 2−1, respectively. The detailed results were listed in Table 3. In addition, receiver operating characteristics (ROC) curves were also plotted in Figure 2 to visually show the prediction capability of our model on discrimination between promoters and non-promoters.

Table 3

The Results for Five Species by Using 10-Fold Cross-Validation

Kingdom	Species	ACC (%)	Sn (%)	Sp (%)	AUC
Eukaryotes	H. sapiens	93.3	92.3	92.7	0.974
	D. melanogaster	93.9	92.6	92.6	0.975
	C. elegans	95.7	95.0	94.4	0.981
Prokaryotes	B. subtilis	95.2	94.8	94.3	0.988
Prokaryotes	E. coli	93.1	92.2	91.2	0.976

Figure 2

Evaluating the iProEP by Using ROC Curve

ROC curves for promoter prediction in (A) H. sapiens, (B) D. melanogaster, (C) C. elegans, (D) B. subtilis, and (E) E. coli.

The Results for Five Species by Using 10-Fold Cross-Validation Evaluating the iProEP by Using ROC Curve ROC curves for promoter prediction in (A) H. sapiens, (B) D. melanogaster, (C) C. elegans, (D) B. subtilis, and (E) E. coli.

Comparison with Existing Promoter Classifiers

Comparison with other existing methods is an important strategy to highlight the merits of proposed models. Currently, several computational methods have been developed for eukaryote and prokaryote promoter prediction.17, 23 To provide a fair comparison of the same data, only a method called IPMD was used to make comparisons, because the same benchmark datasets and same cross-validation rule were used in both works. Furthermore, comparison in the paper has demonstrated that IPMD is superior to other existing predictors, such as NNPP2.2, McPromoter. The IPMD is a hybrid method that combined PCSF and increment of diversity (ID) with the modified Mahalanobis Discriminant. Figure 3 recorded the results obtained by our proposed method and IPMD. The results show that our model is superior to the IPMD model, especially for C. elegans, B. subtilis, and E. coli.

Figure 3

The Comparison between Our Proposed Method with IPMD Classifiers in 10-Fold Cross-Validation

The Comparison between Our Proposed Method with IPMD Classifiers in 10-Fold Cross-Validation Moreover, multi-window Z-curve and PseZNC have been proposed as feature extraction approaches for σ70 promoter prediction in E. coli. Based on the same E. coli data, multi-window Z-curve was re-evaluated in Lin et al. Its overall accuracy is only 77.81% with the area under receiver operating curve (AUC) of 0.8480, which is lower than those of our proposed method. PseZNC is a feature extraction technique that combines multi-window Z-curve with PseKNC. The accuracy of the PseZNC-based method is also lower than our method. Detailed comparison was exhibited in Figure 4. Z-curve theory has been successfully applied in prokaryotic gene prediction because of the characteristics of period-3 in codon. However, promoter sequence cannot code amino acids and dose not obey the codon rule. This is why the two Z-curve-based methods cannot produce better results on promoter prediction.

Figure 4

The Prediction Results of Four Methods on the Same E. Coli σ70 Promoter Data

The Prediction Results of Four Methods on the Same E. Coli σ70 Promoter Data Recently, two predictors called iPromoter-2L and MULTiPly were also designed for E. coli promoter prediction. We could make a raw comparison because the benchmark data in these studies were all derived from RegulonDB. Both predictors could provide multi-layer prediction for recognizing promoters and their subtypes. The former was based on multi-window-based PseKNC and Random Forest, which produced the accuracy (ACC), sensitivity (Sn), and specificity (Sp) of 81.68%, 79.20%, and 84.16%, respectively. The latter obtained the related three indexes of 86.92%, 87.27%, and 86.57% by a SVM-based model. It was found that our proposed model yielded ACC, Sn, and Sp of 93.1%, 92.2%, and 91.2%, respectively (Table 3), which are superior to the two predictors.

Cross-Species Evaluation

Cross-species evaluation on eukaryote and prokaryote was performed to assess the generalization ability of the proposed method. It should be noted that because of the different sequence structure, composition, and regulatory mechanism between eukaryote and prokaryote, the following experiments were performed. We first evaluated the H. sapiens-based model on D. melanogaster and C. elegans data. Results (Table 4) showed that the accuracies are only 77.10% and 66.63% for the two test datasets. Subsequently, we investigated the prediction performances of the D. melanogaster-based model on H. sapiens and C. elegans data. Only 68.41% of H. sapiens sequences and 65.68% of C. elegans sequences can be correctly identified. Finally, we performed similar examinations and obtained similar results on the models from C. elegans, B. subtilis, and E. coli. The unsatisfactory results are mainly due to the species-specificity property of promoter sequences.

Table 4

The Results for Cross-Species Examination

Kingdom	Model Training	Model Test	ACC (%)
Eukaryotes	H. sapiens	D. melanogaster	77.19
	H. sapiens	C. elegans	66.63
	D. melanogaster	H. sapiens	68.41
	D. melanogaster	C. elegans	65.68
	C. elegans	H. sapiens	66.57
	C. elegans	D. melanogaster	69.58
Prokaryotes	B. subtilis	E. coli	75.95
Prokaryotes	E. coli	B. subtilis	80.92

The Results for Cross-Species Examination

Web Server and Tutorial

A user-friendly and publicly accessible web server could provide convenience for researchers.29, 30, 31 Thus, based on our proposed method, we established a powerful web server called iProEP, by which researchers can identify promoters by uploading DNA sequences. A step-by-step guide on how to use the web server is given as follows: Step 1. Click on the web address http://lin-group.cn/server/iProEP/ and the user will see the brief summary about iProEP (Figure 5).

Figure 5

The Homepage of the iProEP Web Server

Available at http://lin-group.cn/server/iProEP/.

The Homepage of the iProEP Web Server Available at http://lin-group.cn/server/iProEP/. Step 2. Click on the “Predictor” on the navigation bar, then choose a suitable species and input the query DNA sequences into the input box for prediction. It should be noted that the sequences must be FASTA format with the length of >300 bp for eukaryote and >81 bp for prokaryote. Click on the “example” button below the input box to see the sample sequence in the FASTA format. Step 3. Click on the “submit” button to obtain the predicted result. If the sequence is longer than 300 or 81 bp, the predictor will scan the sequence using the 300- or 81-bp window with the step of 1 bp for eukaryote or prokaryote, respectively. The result for each subsequence will be displayed on the result page.

Discussion

Computationally identifying promoters has attracted scholars’ attention for many years, and many encouraging results were obtained. However, it is still a challenging topic in bioinformatics. In this work, we proposed a new feature extraction technique that combines PseKNC with PCSF for improving prediction ACC. A series of examinations demonstrated that our proposed method can distinguish promoter from non-promoter sequences with good performance. Thus, we established a predictor iProEP for providing convenience to scholars. In the future work, many more promoters derived from other species will be collected for species-specific promoter prediction.17, 32 Moreover, although the combination of PseKNC and PCSF worked well in this study, new feature extraction techniques should be developed to further improve the performance of promoter prediction. Finally, with accumulation of more and more data and the development of a deep learning technique in many biological problems,17, 21, 33, 34, 35 it is suitable to identify promoters by using a deep learning technique.

Materials and Methods

Benchmark Dataset

A key step for constructing a powerful and robust prediction model is to construct an objective and strict benchmark dataset. In this work, we established five benchmark datasets including promoter and non-promoter sequences for five species (Table 5).

Table 5

The Detail Information of the Training Datasets for Five Species

Kingdom	Species	Promoter	Non-promoter		Location
Kingdom	Species	Promoter	CDS	Non-CDSa	Location
Eukaryotes (300 bp)	H. sapiens	1,787	1,800	1,800	[−249, +50]
Eukaryotes (300 bp)	D. melanogaster	1,886	1,799	2,859	[−249, +50]
Prokaryotes (81 bp)	C. elegans	598	600	600	[−249, +50]
	B. subtilis	270	300	300	[−60, +20]
	E. coli	741	700	700	[−60, +20]

CDS, coding sequences.

Intron for eukaryotes and convergent intergenetic region for prokaryotes.

The Detail Information of the Training Datasets for Five Species CDS, coding sequences. Intron for eukaryotes and convergent intergenetic region for prokaryotes. Eukaryotic Promoter Database (EPD) is a high-quality and non-redundant promoter resource and can be freely accessed at https://epd.epfl.ch//EPD_database.php. The 1,787 H. sapiens and 1,886 D. melanogaster Pol II promoter sequences were obtained from the EPD database. The 598 C. elegans promoter sequences were extracted from CEPDB (C. elegans promoter database; http://rulai.cshl.edu/cgi-bin/CEPDB/home.cgi). Each eukaryotic promoter is 300 bp long from 249 bp upstream to 50 bp downstream regions of TSS (TSS is regarded as 0-th site). For prokaryote, 270 B. subtilis σ43 promoters were collected from DBTBS (http://dbtbs.hgc.jp), and 741 E. coli K-12 σ70 promoter sequences were gained from RegulonDB (http://regulondb.ccg.unam.mx/). All prokaryotic promoters have 81 nt with the region from −60 to +20 flanking TSS (TSS is regarded as the 0-th site). The negative datasets were taken from the five species genome sequences. We randomly extracted 1,800 coding sequences and 1,800 introns from human DNA sequences from http://www.fruitfly.org/sequence/human-datasets.html to generate the non-promoter dataset for H. sapiens. For D. melanogaster, a negative dataset including 2,859 coding sequences and 1,799 introns was downloaded from the website (http://www.fruitfly.org/sequence/drosophila-datasets.html). The negative sample of C. elegans contains 600 coding sequences, and 600 introns were randomly extracted from Exon-Intron Database (EID). For prokaryotes, all negative samples were randomly taken from the well-known database GenBank. The number of non-promoter sequences for B. subtilis and E. coli are 600 (including 300 coding sequences and 300 convergent intergenic sequences) and 1,400 (including 700 coding sequences and 700 convergent intergenic sequences), respectively. To get rid of the influence of noise data, we eliminated the sequences that contain other IUPAC code letters, such as “N,” “S,” and “W,” from both positive and negative datasets. In order to ensure that the format of negative sequences can match the promoters, the lengths of eukaryotic and prokaryotic non-promoter sequences are also 300 and 81 bp, respectively. The details of the benchmark datasets were listed in Table 5. It is well known that sequence similarity could influence the evaluation on the proposed mode. We investigated the sequence similarity of the five species promoters by using CD-HIT. After setting the cutoff of sequence identity to 0.8 to exclude high similar promoters, we found that 98.0%, 99.3%, 95.0%, 98.5%, and 96.0% promoters for H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli remained, suggesting that the original datasets are objective enough to construct prediction models. Moreover, for the purpose of providing an objective comparison with the previous promoter prediction method IPMD, the same benchmarking datasets as used by IPMD are also provided. All data used in this study can be freely downloaded from http://lin-group.cn/server/iProEP/pages/download.php.

Pseudo k-Tuple Nucleotide Composition (PseKNC)

In general, the input of almost all the existing machine learning classification methods, such as SVM,44, 45, 46 Random Forest, and Artificial Neural Network,48, 49, 50 must be a numeric value rather than a string sequence. Thus, each sample must be transferred into a fixed length of the feature vector. A simple and common strategy to transform a DNA sample into a vector is to use its k-tuple nucleotide composition, which can be formulated by a vector with 4 elements according to the following formula:where the symbol means the transposition of a vector, and is the normalized frequency of the i-th k-tuple nucleotide component occurring in the DNA sequence. In order to take both local and global sequence-order information of a DNA sequence into consideration, PseKNC51, 52 was proposed and has been widely utilized to represent DNA or RNA sequences.53, 54 Its basic principle is to combine the correlation of physiochemical properties of oligonucleotides and k-mer composition to formulate DNA sequences. There are two kinds of PseKNCs: type I and type II PseKNC. The former is also called the parallel correlation type, which mixes different physicochemical properties together to represent a nucleotide sequence with a vector containing 4 + Λ components. The latter is named the series correlation type, which describes a nucleotide sequence by a vector containing 4 + λΛ factors. Comparing with the type I PseKNC, which has been widely and successfully applied in various bioinformatics fields,8, 55 few works focused on the application of type II PseKNC.54, 56 Considering the merit of type II PseKNC that different correlation information was separated independently, this work employed the type II PseKNC to transform sample sequences into vectors given as below:where has the same meaning as in Equation 2; λ is an integer number less than L − k, which reflects the correlation tiers or correlation rank along a DNA sequence; ω is a weight factor used to balance the effect of global correlation information and local property; and τ represents the m-tier correlation factor, which describes the sequence-order correlation between all the m-tier contiguous k-tuple nucleotides along a DNA sequence. Here τ can be calculated bywherewhere is a numerical value of the ξ-th physicochemical property for the dinucleotide at position i, is the corresponding value for the dinucleotide at position i + m, and Λ is the number of physicochemical properties. In this study, six DNA local structural properties of the 16 DNA dinucleotides were utilized in this work; the concrete values of three local translational properties (slide, shift, rise) and three local angular properties (roll, tilt, twist) were taken from Goñi et al.’s work. It should be noted that the original values of six DNA local structural properties should be subjected to a standard version by Equation 7 and then can be used in Equation 6 to calculate PseKNC:where is the original value of the ξ-th DNA local structural property for the dinucleotide at position i, the symbol means taking the average of the quantity therein for the 16 different combinations of A, G, C, T for , and SD represents the corresponding SD. The standard version of these physicochemical property values can be also found in many other DNA-related studies. The superiority of the final standard 16 values converted by Equation 7 is that they will have a zero mean value over the 16 different dinucleotides and will not be changed if going through the same conversion procedure again.

PCSF

By aligning promoter sequences for every species, we can construct a position-correlation scoring matrix (PCSM).24, 59 Each row in the PCSM consisted of factor p, which is the probability of k-mer x at the i-th site of promoter samples. p can be calculated by the following formula:where n is the actual count of x appearing at the i-th site, and b is the corresponding pseudocount. N indicates the sum of real counts of all k-mers at the i-th site (namely, positive sample number), and B is the corresponding sum of the pseudocount. If the sample size is not large enough, some k-mers will not be present when k increases. Hence, the pseudocount could improve estimation of the probability p for k-mer x at the i-th site. B and b can be given byin which p0 is the background frequency of k-mer, which is equal to 1/4. With the increasing sample number N, the influence of pseudocounts will weaken, because of the slow increase of . Some conservation sites of trimers for five species have been screened out by a great number of complex conservation analyses and ACC evaluations in Lin and Li. Based on these sites and PCSM, the PCSF feature of positive and negative samples for five species can be expressed aswhere n is the number of selected conservation sites, and each element is defined asIn this equation, p0 is the background probability of each trimer (p0 = 1/43), and p can be obtained on the basis of PCSM.

mRMR

Commonly, picking out of the most useful features from the high-dimension data is a requisite step to exclude noise, improve prediction ACC and efficiency, avoid model overfitting, as well as build a robust model. In the present work, with the increase of two variables in Equation 4, k and λ, the dimension of PseKNC features will raise sharply, which may result in the curse of dimensionality. Therefore, it is absolutely necessary to find out the optimal features that could produce a robust model with highest ACC. mRMR is a popular feature selection technique that could calculate a score for each feature for measuring the importance of the feature.60, 61 It used a series of intuitive measures of relevance and redundancy to find a very compact subset from candidate features and has been widely used in data mining of biological processes.62, 63, 64, 65 For discrete features, two selection criteria, Mutual Information Difference criterion (MID) and Mutual Information Quotient criterion (MIQ), can be used to calculate the score of a feature. In the study, we chose the score from MIQ. After scoring the PseKNC and PCSF features by mRMR, the IFS strategy with 5-fold cross-validation was applied to obtain the best feature subset that could produce the maximum prediction ACC. During the IFS procedure, the ranked features were added in the training set one by one according to mRMR rank; IFS strategy evaluates the performance of the top k-ranked features. The 5-fold cross-validation was to seek the best penalty coefficient c and width parameter γ for SVM models when obtaining the best feature subset.54, 56

SVM

SVM is a widely employed machine learning algorithm based on statistical learning theory and has been extended in bioinformatics fields.67, 68, 69, 70, 71, 72, 73 The core idea of SVM is to seek out a classification hyperplane that can maximize the margin of the feature space. LibSVM is a popular softpackage for executing SVM and can be freely downloaded from https://www.csie.ntu.edu.tw/∼cjlin/libsvm/. This study used LibSVM with radial basis function (RBF) to perform classification. We employed the grid search method with cross-validation to seek the best penalty coefficient c and width parameter γ. The searching space is as follows:

Performance Evaluation Metrics

In order to assess the quality of a predictor and compare different prediction tools, the following three indexes, namely, the overall ACC, Sn, and Sp, were used and formulated aswhere TP (true positive) and TN (true negative) present the numbers of correctly identified promoters and non-promoters, respectively, and FP (false positive) and FN (false negative) denote the number of non-promoters incorrectly classified as promoters and the number of promoters incorrectly classified as non-promoters. ROC analysis was used to measure the performance of the model with the varying of decision thresholds.

Author Contributions

H.D., W.C., and H.L. conceived and designed the study. H.-Y.L., Z.-Y.Z., Z.-D.S., and W.S. conducted the experiments. H.-Y.L. and Z.-Y.Z. implemented the algorithms. H.-Y.L., Z.-Y.Z., and Z.-D.S. established the web server. H.-Y.L., Z.-Y.Z., W.C., and H.L. performed the analysis and wrote the paper. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no competing interests.

30 in total

1. Computational prediction of species-specific yeast DNA replication origin via iterative feature representation.

Authors: Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Gwang Lee
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

2. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Authors: Meng Zhang; Cangzhi Jia; Fuyi Li; Chen Li; Yan Zhu; Tatsuya Akutsu; Geoffrey I Webb; Quan Zou; Lachlan J M Coin; Jiangning Song
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

3. PredPromoter-MF(2L): A Novel Approach of Promoter Prediction Based on Multi-source Feature Fusion and Deep Forest.

Authors: Miao Wang; Fuyi Li; Hao Wu; Quanzhong Liu; Shuqin Li
Journal: Interdiscip Sci Date: 2022-04-30 Impact factor: 3.492

4. Extremely-randomized-tree-based Prediction of N⁶-Methyladenosine Sites in Saccharomyces cerevisiae.

Authors: Rajiv G Govindaraj; Sathiyamoorthy Subramaniyam; Balachandran Manavalan
Journal: Curr Genomics Date: 2020-01 Impact factor: 2.236

5. Prokaryotic and eukaryotic promoters identification based on residual network transfer learning.

Authors: Xiao Liu; Yuqiao Xu; Yachuan Luo; Li Teng
Journal: Bioprocess Biosyst Eng Date: 2022-03-13 Impact factor: 3.210

6. TSSFinder-fast and accurate ab initio prediction of the core promoter in eukaryotic genomes.

Authors: Mauro de Medeiros Oliveira; Igor Bonadio; Alicia Lie de Melo; Glaucia Mendes Souza; Alan Mitchell Durham
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 11.622

7. Identification of Human Enzymes Using Amino Acid Composition and the Composition of k-Spaced Amino Acid Pairs.

Authors: Lifu Zhang; Benzhi Dong; Zhixia Teng; Ying Zhang; Liran Juan
Journal: Biomed Res Int Date: 2020-05-22 Impact factor: 3.411

8. Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Vicia faba Based on Genotyping by Sequencing Data Using Deep Learning.

Authors: Felix Heinrich; Martin Wutke; Pronaya Prosun Das; Miriam Kamp; Mehmet Gültas; Wolfgang Link; Armin Otto Schmitt
Journal: Genes (Basel) Date: 2020-06-05 Impact factor: 4.096

9. iRNA-m2G: Identifying N²-methylguanosine Sites Based on Sequence-Derived Information.

Authors: Wei Chen; Xiaoming Song; Hao Lv; Hao Lin
Journal: Mol Ther Nucleic Acids Date: 2019-08-28 Impact factor: 8.886

10. SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction.

Authors: Yideng Cai; Jiacheng Wang; Lei Deng
Journal: Front Bioeng Biotechnol Date: 2020-04-29