Literature DB >> 31536883

iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features.

Abstract

Promoters are short regions at specific locations of DNA sequences, which are playing key roles in directing gene transcription. They can be grouped into six types (σ24,σ28,σ32,σ38,σ54,σ70). Recently, a predictor called "iPromoter-2L" was constructed to predict the promoters and their six types, which is the first approach to predict all the six types of promoters. However, its predictive quality still needs to be further improved for real-world application requirement. In this study, we proposed the smoothing cutting window algorithm to find the window fragments of the DNA sequences based on the conservation scores to capture the sequence patterns of promoters. For each window fragment, the discriminative features were extracted by using kmer and PseKNC. Combined with support vector machines (SVMs), different predictors were constructed and then clustered into several groups based on their distances. Finally, a new predictor called iPromoter-2L2.0 was constructed to identify the promoters and their six types, which was developed by ensemble learning based on the key predictors selected from the cluster groups. The results showed that iPromoter-2L2.0 outperformed other existing methods for both promoter prediction and identification of their six types, indicating that iPromoter-2L2.0 will be helpful for genomics analysis.

Entities: Chemical Disease Gene Species

Keywords: ensemble learning; promoter; smoothing cutting window algorithm

Year: 2019 PMID： 31536883 PMCID： PMC6796744 DOI： 10.1016/j.omtn.2019.08.008

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

A promoter is a DNA fragment at a specific location that can be recognized and bound by RNA polymerase to initiate transcription. In bacteria, the RNA polymerase contains five subunits (2β′,) and an extra factor.1, 2 The factors can be labeled as according to the molecular weights. Different factors direct the RNA polymerase binding to different promoter regions, which can affect the consequent activation of genes. and participate in heat-shock response, participates in the flagellar gene expression during normal growth, participates in nitrogen metabolism, and , called primary factor, is in charge of transcription of most genes in growing cells.2, 3, 4 Because the wet experiments are expensive to identify the types of promoters, several predictors were developed to identify the promoters based on the DNA sequence information; for example, iPro54-PseKNC based on the PseKNC was constructed to identify promoters. A position-correlation scoring function (PCSF) and Bayes profile were proposed to identify promoter. By combining the variable window technique with the regular Z-curve method,9, 10, 11 “variable-window Z-curve” was proposed to detect promoters. These methods were discussed in a recent study. Recently, the iPromoter-2L has been proposed, which is the first predictor that is able to predict the promoters and their aforementioned six different types. This predictor employed the multi-window-based PseKNC approach to capture the sequence patterns of the promoters. However, for this predictor, it is extremely hard to find the optimized sequence windows by using the flexible-sliding-window approach to extract the discriminative features, preventing the performance improvement of this method. In order to overcome these shortcomings, in this study we proposed the smoothing cutting window (SCW) algorithm to divide the DNA sequences into fragment windows based on the conservation scores and ensemble of different predictors based on various sequence-based features to further improve the predictive performance.

Results and Discussion

Comparison with Other Existing Methods

Table 1 shows the results (Equation 24) generated by iPromoter-2L2.0 via the 5-fold validation on the benchmark dataset. The corresponding rates obtained by the existing methods are also given in Table 1. For the second-layer prediction, only the iPromoter-2L and iPromoter-2L2.0 are able to predict the promoter types among the five existing methods.

Table 1

A Comparison of iPromoter-2L2.0 with Other Predictors for Identifying Promoters (the First Layer) and Their Types (the Second Layer) via the 5-fold Cross-Validation on the Same Benchmark Dataset

Method	Acc (%)	MCC	Sn (%)	Sp (%)
First Layer

PCSFa	74.81	0.4980	78.92	70.70
vw Z-curvea	80.28	0.6098	77.76	82.80
Stabilitya	78.04	0.5615	76.61	79.48
iPro54a	80.45	0.6100	77.76	83.15
iPromoter-2L1.0a	81.68	0.6343	79.20	84.16
iPromoter-2L2.0b	84.98	0.6998	84.13	85.84

Second Layer

iPromoter-2L1.0a
σ²⁴ promoter	93.50	0.7338	72.52	96.93
σ²⁸ promoter	96.82	0.5708	42.54	99.49
σ³² promoter	94.41	0.6524	52.58	99.14
σ³⁸ promoter	94.69	0.2962	15.34	99.48
σ⁵⁴ promoter	94.04	0.6459	53.19	99.57
σ⁷⁰ promoter	80.66	0.6056	95.34	59.35
iPromoter-2L2.0b
σ²⁴ promoter	94.62	0.8053	81.82	97.22
σ²⁸ promoter	97.94	0.7561	71.64	99.23
σ³² promoter	95.38	0.7361	71.82	98.05
σ³⁸ promoter	94.58	0.2242	7.36	99.85
σ⁵⁴ promoter	98.11	0.6714	59.57	99.42
σ⁷⁰ promoter	85.94	0.7109	95.22	72.47

See Equation 1. Acc, accuracy; Sn, sensitivity; Sp, specificity.

The results reported in Liu et al.

The predictor proposed in this study.

A Comparison of iPromoter-2L2.0 with Other Predictors for Identifying Promoters (the First Layer) and Their Types (the Second Layer) via the 5-fold Cross-Validation on the Same Benchmark Dataset See Equation 1. Acc, accuracy; Sn, sensitivity; Sp, specificity. The results reported in Liu et al. The predictor proposed in this study. From Table 1 we can see the following: (1) for the first-layer prediction, the iPromoter-2L2.0 outperformed all the other methods in terms of all the four performance measures (cf. Equation 24); (2) for the second-layer prediction, the iPromoter-2L2.0 outperformed iPromoter-2L for the prediction of σ24 promoters, σ28 promoters, σ32 promoters, σ54 promoters, and σ70 promoters in terms of accuracy (Acc) and Matthew’s correlation coefficient (MCC), and its performance is comparable with that of iPromoter-2L for the prediction of σ38 promoters. The reasons for the performance improvement of the iPromoter-2L predictor is that it is based on the SCW algorithm, which is able to more accurately extract the sequence features to discriminate the promoters and their types. It can be anticipated that the proposed SCW algorithm would have many potential applications, such as enhancer prediction, DNA replication origin prediction, etc.

Web Server and Its User Guide

We established a web server for iPromoter-2L2.0 so as to help the readers to use the proposed method by following the steps below. Step 1. Click the hyperlink http://bliulab.net/iPromoter-2L2.0/ to access the homepage as shown in Figure 1. An introduction to the web server is given in the Read Me.

Figure 1

A Screenshot of the Homepage of the Web Server for iPromoter-2L2.0

iPromoter-2L2.0 can be accessed at http://bliulab.net/iPromoter-2L2.0/.

A Screenshot of the Homepage of the Web Server for iPromoter-2L2.0 iPromoter-2L2.0 can be accessed at http://bliulab.net/iPromoter-2L2.0/. Step 2. Copy/paste or type the query DNA sequences into the input box at the center of Figure 1 or upload the data by the Browse button. Step 3. Click on the Submit button—you will see the predicted results. If using the example sequences for the prediction, you will see the following results: (1) both the first and the second query sequences are non-promoters; (2) the third query sequence is a σ70 promoter. Step 4. On the results, the predictive result can be downloaded via clicking the Download button.

Materials and Methods

Benchmark Dataset

To facilitate performance comparison of various methods, we employed the dataset to construct the predictor and evaluate the performance of various methods, which can be formulated aswhere “” indicates the “union” in the theory; indicates promoter samples; indicates non-promoter samples; and , , , , , and indicate six kinds of promoters. Specifically, the benchmark dataset consists of 5,920 samples, half of which are promoters, and the others are non-promoters. contains 484 samples; contains 134 samples; contains 291 samples; contains 163 samples; contains 94 samples; contains 1,694 samples.

Sample Formulation

In this study, the DNA sequence samples were divided into several fragment windows by using the proposed SCW algorithm, and then for each fragment window, a sliding window approach was used to extract the sequence features by using kmer and PseKNC.6, 14, 15

SCW Algorithm

Previous studies showed that the distribution of conservation scores between promoters and non-promoters are obviously different. Here, we proposed the SCW algorithm to incorporate these sequence patterns into the predictor so as to improve the predictive performance. A DNA sample is represented aswhere denotes the i-th nucleotide at the sequence position i. It can be one of the following four nucleotides, i.e.,where refers to “member of,” a symbol in set theory. To reflect the conservation score distribution patterns along , it was split into S+1 fragments by the cutting points (S is the total number of cutting points), which can be represented asThe cutting point is defined as follows:whereis a distance threshold, which was set as 8 in this study, is the candidate cutting point, and Z is the total number ofFor a given sequence position , is defined aswhere represents the smooth standard deviation of the average conservation score (CS) of sequence position , which can be calculated bywhere k is the sequence position and is the standard deviation of the average CS at the k-th sequence position, which can be calculated bywhere Y represents number of labels, which is equal to 2 for the first layer and 6 for the second layer. denotes the y-th class samples’ average CS at the k-th sequence position, which can be calculated by the approach introduced in Schneider and Stephens. is the average CS of all labels at the k-th position. The conservation profiles and the standard deviations of promoters and non-promoters are shown in Figure 2A, and the conservation profile and the standard deviation of each promoter type are shown in Figure 3A. The smooth standard deviation curves are shown in Figures 2B and 3B. The DNA sequences were divided into several fragments by SCW as shown in Figures 2C and 3C. The pseudo-code of SCW algorithm is shown in Box 1.

Figure 2

A Flowchart Shows the Steps of the Proposed Smoothing Cutting Window Algorithm for the First-Layer Prediction

The standard deviations shown in (A) are converted into the smooth standard deviations as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C).

Figure 3

A Flowchart Shows the Process of the Proposed Smoothing Cutting Window Algorithm for the Second-Layer Prediction

The SDs shown in (A) are converted into the smooth SDs as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C).

A Flowchart Shows the Steps of the Proposed Smoothing Cutting Window Algorithm for the First-Layer Prediction The standard deviations shown in (A) are converted into the smooth standard deviations as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C). A Flowchart Shows the Process of the Proposed Smoothing Cutting Window Algorithm for the Second-Layer Prediction The SDs shown in (A) are converted into the smooth SDs as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C). Parameters: sequence length L, number of label Y Input: DNA sequence in Equation 1 Output: cutting points For y = 1 to Y do For i = 1 to L do Calculate conservation score End for End for For i = 1 to L do Calculate SSD by Equation 7 End for Calculate cutting points by Equations 5 and 6 and SSD Return After the process shown in Box 1, each DNA sequence in (cf. Equation 1) was divided into four fragments ([1, 28], [29, 44], [45, 56], [57, 81]), and each DNA sequence in (cf. Equation 1) was divided into four fragments ([1, 17], [18, 41], [42, 56], [57, 81]). Then for each fragment, the sliding-window approach was used to extract the features. A sliding window can be expressed by , where is the width of the window and is the step of sliding window. For each fragment obtained, the number of the segments produced by along the fragment sequence is given bywhere “INT” is an “integer-cutting operator.” denotes the length of the i-th fragment. For example, assuming , = 6, and = 1 in Equation 9, we obtain. For example, we can obtain 24 DNA segments with the sliding window of on the i-th fragment of length 29.

kmer

kmer is a simple and effective method to extract the information in the DNA sequence. By using kmer, the DNA sequence fragment (cf. Equation 4) can be represented aswhere is the frequencies of k neighboring nucleotides in the fragment ,and T represents transpose operator. For example, Equation 10 is a 4-mer vector when .

PseKNC

The PseKNC incorporates the short-range sequence information, the long-range sequence information, and the physicochemical properties of the dinucleotides, which can formulate the DNA sequence fragment of Equation 4 asPseKNC has three parameters: k, (the number of sequence correlations considered), and w (the weight factor). Each of the parameters has been clearly defined in a paper and a comprehensive review. The kmer and PseKNC can be easily generated by some exiting tools, such as Pse-in-One and PseKNC-General.

Operation Engine

Support vector machines (SVMs) were successfully applied in several bioinformatics problems (B.L., C. L., and K. Yan, unpublished data).20, 21, 22, 23, 24 In this study, we employed SVMs to build the predictor. We used the SVM with radial basis function (RBF) kernel in the Scikit-learn package. The SVM has two parameters: (regularization) and (kernel width). Accordingly, when combining sliding-window approach and SVM based on kmer or PseKNC, there are a total of , or parameters, respectively. The values of and will be given later. For the sliding-window with,For the kmer approach with30 elementary classifiers can be developed, as denoted byFor the PseKNC approach with46 elementary classifiers can be developed, denoted byTherefore, we have a total of 30 + 46 = 76 elementary classifiers.

Ensemble Learning

Inspired by the previous studies,13, 26, 27, 28, 29, 30, 31, 32 by using a voting system, a series of individual predictors can develop an ensemble predictor with better prediction quality. When developing an ensemble learning model, there are two fundamental issues: the selection of the individual classifiers with low correlation from the elementary classifiers and the construction of an ensemble classifier by fusing the selected classifiers. In this study, we employed the affinity propagation (AP) clustering algorithm to cluster the elementary classifiers based on the distance among classifiers. For each cluster, one key classifier was selected. In order to measure the complementarity of different elementary classifiers, the distance between any two elementary classifiers and was measured by the following equation:where m is the training sample number, is the classification probability of classifier on the k-th sample, and is calculated bywhere Y represents number of labels. Y was set as 2 and 6 for promoter identification and their type prediction, respectively. represents the probability of predicting k-th sample as category y. By using Equations 18 and 19, the distance between any elementary classifiers can be accurately measured. The range of is from 0 to 1, where 1 indicates the predictive results of two classifiers are completely complementary and 0 means that their results are identical. The elementary classifiers were then grouped into different clusters by using the AP clustering algorithm. The flowchart of the proposed iPromoter-2L2.0 predictor is shown in Figure 4.

Figure 4

A Flowchart Shows How iPromoter-2L2.0 Is Working

A Flowchart Shows How iPromoter-2L2.0 Is Working For the first layer, 10 key classifiers were obtained (Table 2) as formulated byFor the second layer, nine key classifiers were obtained (Table 3) as formulated byBy fusing the 10 key classifiers (cf. Equation 20) following this study, we can obtain the first-layer ensemble predictor as given byBy fusing the nine key classifiers (cf. Equation 21), we can obtain the second-layer ensemble predictor given bywhere the symbol in Equations 22 and 23 means that linear combination of the key individual classifiers. The weight factors were optimized by the genetic algorithm, and the parameters (population size, evolutional generations) of genetic algorithm were set as 200 and 2,000, respectively, for the first and second layers.

Table 2

The Six Key Classifiers for the First-Layer Prediction

Key Classifier	Feature Vector	Dimension
C1(1)	kmera	768
C1(2)	kmerb	396
C1(3)	kmerc	2,880
C1(4)	kmerd	624
C1(5)	PseKNCe	1,080
C1(6)	PseKNCf	11,880
C1(7)	PseKNCg	46,440
C1(8)	PseKNCh	1,566
C1(9)	PseKNCi	2,808
C1(10)	PseKNCj	729

The parameters used: , , k = 1, , .

The parameters used: , k = 1, , .

The parameters used: , , k = 2, , .

The parameters used: , , k = 1, , .

The parameters used: , , k = 1, λ = 2, w = 0.5,

The parameters used: , , k = 3, λ = 2, w = 0.5, .

The parameters used: , , k = 4, λ = 2, w = 0.5,

The parameters used: , , k = 2, λ = 2, w = 0.5,

The parameters used: , , k = 1, λ = 5, w = 0.5,

Table 3

The 10 Key Classifiers for the Second-Layer Prediction

Key Classifier	Feature Vector	Dimension
C2(1)	kmera	1,584
C2(2)	kmerb	2,688
C2(3)	PseKNCc	11,880
C2(4)	PseKNCd	1,008
C2(5)	PseKNCe	3,528
C2(6)	PseKNCf	1,566
C2(7)	PseKNCg	2,808
C2(8)	PseKNCh	729
C2(9)	PseKNCi	1,296

The parameters used: , , k = 2, , .

The parameters used: , , k = 3, λ = 2, w = 0.5,

The parameters used: , , k = 1, λ = 2, w = 0.5,

The parameters used: , , k = 2, λ = 5, w = 0.5,

The parameters used: , , k = 2, λ = 2, w = 0.5,

The parameters used: , , k = 1, λ = 5, w = 0.5,

The parameters used were as follows: , , k = 1, λ = 5, w = 0.5,

The Six Key Classifiers for the First-Layer Prediction The parameters used: , , k = 1, , . The parameters used: , k = 1, , . The parameters used: , , k = 2, , . The parameters used: , , k = 1, , . The parameters used: , , k = 1, λ = 2, w = 0.5, The parameters used: , , k = 3, λ = 2, w = 0.5, . The parameters used: , , k = 4, λ = 2, w = 0.5, The parameters used: , , k = 2, λ = 2, w = 0.5, The parameters used: , , k = 2, λ = 2, w = 0.5, The parameters used: , , k = 1, λ = 5, w = 0.5, The 10 Key Classifiers for the Second-Layer Prediction The parameters used: , , k = 2, , . The parameters used: , , k = 2, , . The parameters used: , , k = 3, λ = 2, w = 0.5, The parameters used: , , k = 1, λ = 2, w = 0.5, The parameters used: , , k = 2, λ = 5, w = 0.5, The parameters used: , , k = 2, λ = 2, w = 0.5, The parameters used: , , k = 2, λ = 2, w = 0.5, The parameters used: , , k = 1, λ = 5, w = 0.5, The parameters used were as follows: , , k = 1, λ = 5, w = 0.5,

Cross-Validation and Performance Measures

The performance of various predictors was evaluated by using 5-fold cross-validation with the following performance measures:where , and Y is the number of classes of this system. i is the i-th class or type. For the first-layer prediction, the value of is 2, and the value of represents the promoter (i = 1) or non-promoter (i = 2). Similarly, for the second-layer prediction, the value of is 6 and the value of is 1, 2, 3, 4, 5, or 6 for , , , , , or promoters, respectively. For the detail of these performance measures, please refer to a recent study.

Author Contributions

B.L. provided the main idea of the manuscript and wrote the manuscript. K.L. did the experiments and wrote the manuscript.

Conflicts of Interest

The authors declare no competing interests.

26 in total

1. Sequence logos: a new way to display consensus sequences.

Authors: T D Schneider; R M Stephens
Journal: Nucleic Acids Res Date: 1990-10-25 Impact factor: 16.971

2. i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome.

Authors: Wei Chen; Hao Lv; Fulei Nie; Hao Lin
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

3. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions.

Authors: Wei Chen; Xitong Zhang; Jordan Brooker; Hao Lin; Liqing Zhang; Kuo-Chen Chou
Journal: Bioinformatics Date: 2014-09-16 Impact factor: 6.937

4. iRSpot-EL: identify recombination spots with an ensemble learning approach.

Authors: Bin Liu; Shanyi Wang; Ren Long; Kuo-Chen Chou
Journal: Bioinformatics Date: 2016-08-16 Impact factor: 6.937

5. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers.

Authors: Quan Zou; Jiasheng Guo; Ying Ju; Meihong Wu; Xiangxiang Zeng; Zhiling Hong
Journal: Mol Inform Date: 2015-09-14 Impact factor: 3.353

6. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework.

Authors: Bin Liu; Ren Long; Kuo-Chen Chou
Journal: Bioinformatics Date: 2016-04-08 Impact factor: 6.937

7. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC.

Authors: Bin Liu; Fan Yang; De-Shuang Huang; Kuo-Chen Chou
Journal: Bioinformatics Date: 2018-01-01 Impact factor: 6.937

8. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition.

Authors: Wei Chen; Tian-Yu Lei; Dian-Chuan Jin; Hao Lin; Kuo-Chen Chou
Journal: Anal Biochem Date: 2014-04-13 Impact factor: 3.365

9. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition.

Authors: Peng-Mian Feng; Wei Chen; Hao Lin; Kuo-Chen Chou
Journal: Anal Biochem Date: 2013-06-10 Impact factor: 3.365

10. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features.

Authors: Wenying He; Cangzhi Jia; Yucong Duan; Quan Zou
Journal: BMC Syst Biol Date: 2018-04-24

18 in total

1. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Authors: Meng Zhang; Cangzhi Jia; Fuyi Li; Chen Li; Yan Zhu; Tatsuya Akutsu; Geoffrey I Webb; Quan Zou; Lachlan J M Coin; Jiangning Song
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

2. RFPR-IDP: reduce the false positive rates for intrinsically disordered protein and region prediction by incorporating both fully ordered proteins and disordered proteins.

Authors: Yumeng Liu; Xiaolong Wang; Bin Liu
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622