Literature DB >> 22369687

GPMiner: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group.

Tzong-Yi Lee¹, Wen-Chi Chang, Justin Bo-Kai Hsu, Tzu-Hao Chang, Dray-Ming Shien.

Abstract

BACKGROUND: Sequence features in promoter regions are involved in regulating gene transcription initiation. Although numerous computational methods have been developed for predicting transcriptional start sites (TSSs) or transcription factor (TF) binding sites (TFBSs), they lack annotations for do not consider some important regulatory features such as CpG islands, tandem repeats, the TATA box, CCAAT box, GC box, over-represented oligonucleotides, DNA stability, and GC content. Additionally, the combinatorial interaction of TFs regulates the gene group that is associated with same expression pattern. To investigate gene transcriptional regulation, an integrated system that annotates regulatory features in a promoter sequence and detects co-regulation of TFs in a group of genes is needed.
RESULTS: This work identifies TSSs and regulatory features in a promoter sequence, and recognizes co-occurrence of cis-regulatory elements in co-expressed genes using a novel system. Three well-known TSS prediction tools are incorporated with orthologous conserved features, such as CpG islands, nucleotide composition, over-represented hexamer nucleotides, and DNA stability, to construct the novel Gene Promoter Miner (GPMiner) using a support vector machine (SVM). According to five-fold cross-validation results, the predictive sensitivity and specificity are both roughly 80%. The proposed system allows users to input a group of gene names/symbols, enabling the co-occurrence of TFBSs to be determined. Additionally, an input sequence can also be analyzed for homogeneity of experimental mammalian promoter sequences, and conserved regulatory features between homologous promoters can be observed through cross-species analysis. After identifying promoter regions, regulatory features are visualized graphically to facilitate gene promoter observations.
CONCLUSIONS: The GPMiner, which has a user-friendly input/output interface, has numerous benefits in analyzing human and mouse promoters. The proposed system is freely available at http://GPMiner.mbc.nctu.edu.tw/.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Transcription Factors

Year: 2012 PMID： 22369687 PMCID： PMC3587379 DOI： 10.1186/1471-2164-13-S1-S3

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

Gene transcription is regulated by transcription factors (TFs) that bind specifically to promoter regions; which is the crucial control region for transcriptional activation of all genes [1]. A typical promoter sequence, which is located near the transcriptional start site (TSS), is believed to comprise short DNA sequences known as regulatory elements, including TF binding sites (TFBSs) [2]. With the vast amount of available genomic data, an increasing need exists for techniques that can rapidly and accurately evaluate sequences for the presence of promoters [3]. Furthermore, some important regulatory motifs, such as the TATA box, CCAAT box, GC box, and INR box, must be annotated in promoter sequences. Further, the presence of CpG islands close to a TSS, statistical properties of proximal and core promoters rather than other genomic sequences, orthologous gene promoters, and restricting a promoter region from using information from mRNA transcripts must be considered [4]. Additionally, some co-regulatory networks describe the set of all significant associations among TFs in regulating common target genes [5]. Accordingly, the combinatorial interaction of TFs is critical in gene regulation. PlantPAN, a database-assisted system for recognizing co-occurrence of cis-regulatory elements in plant co-expressed genes [6], is effective for plant promoter investigations. However, no similar resource exists for identifying co-occurrence TFBSs in a group of mammalian promoters. Veerla et al. recently developed SMART software for identifying co-occurring TFBSs in gene set promoters [7]. Nevertheless, this software does not have a user-friendly interface for identifying TSSs with regulatory elements and efficiently analyzing combinatorial TFBSs of a group of promoters. COXPRESdb provides coexpressed gene networks and coexpressed gene lists ordered based on the strength of coexpression for humans and mice [8]. However, COXPRESdb does not analyze TFBSs in co-expressed gene promoters. Although TOUCAN is a Java application for identifying significant cis-regulatory elements from sets of co-expressed genes, TOUCAN ignores combinatorial TFBSs analysis [9]. This work develops a novel system, Gene Promoter Miner (GPMiner), for identifying co-occurring TFBSs in a group of gene promoters. However, the promoter region must be precisely identified before identification of TFBSs co-occurrence. Many databases are useful in collecting numerous TSSs and have promoter prediction tools. The DBTSS is a TSS database established by gathering experimentally identified promoter regions via the oligo-capping method [10]. The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the TSS has been determined experimentally [11]. Various promoter prediction methods have been developed for analyzing gene promoter regions (Table S1, additional file 1). The CpGProD program identifies CpG islands in mammalian promoter regions [12]. The DragonGSF program predicts gene promoters based on information of CpG islands, TSSs and downstream signals of predicted TSSs [13]. The NNPP2.2 program applies a time-delay neural network for promoter annotation of the Drosophila melanogaster genome [14]. The Eponine detects the transcriptional initiation site near the TATA box, together with flanking regions of GC enrichment [15]. To identify TSSs, McPromoter, a statistical method, identifies the eukaryotic polymerase II TSS in genomic DNA [16-18]. The FirstEF uses a set of discriminant functions that can recognize both boundaries of the first exon [19]. The PromoSer method computationally identifies TSSs by considering the alignments of numerous partial and full-length mRNA sequences to those of genomic DNA [20]. The PromH scheme identifies promoters based on conservation of regulatory features in pairs of human/mouse orthologous genes. Another regulatory feature of promoter regions, DNA stability, was investigated for analyzing prokaryotic promoters [21]. Notably, DNA stability is a structural property of the DNA duplex fragment. The minimum free energy of the DNA duplex is calculated based on hydrogen bonding of A-T and C-G pairs. Kanhere et al. demonstrated that DNA stability of promoter regions provides a much better clue than other features when determining the location of the TSS [21]. Although numerous computational methods have been developed for identifying promoters of genes in genomic sequences, their outcomes are not satisfactory, especially for promoters lacking a TATA box and CpG islands [1]. Furthermore, many methods have poor predictive specificity, generating many false-positive predictions, or have poor sensitivity. Therefore, this work develops an integrated system, GPMiner, that identifies promoter regions with high predictive sensitivity and specificity. Moreover, GPMiner comprehensively annotates regulatory elements, including TFBSs, CpG islands, tandem repeats, the presence of a TATA box, CCAAT box, or GC box, statistically over-represented sequence patterns, GC content (GC%), and DNA stability. Additionally, GPMiner accurately identifies combinatorial TFBSs in a group of gene promoters.

Construction and content

Figure 1 presents the GPMiner system flow, which identifies promoter regions and annotates transcriptional regulatory features in a user-input genomic sequence. Computational models for promoter identification were constructed by incorporating the support vector machine (SVM) with nucleotide composition features, over-represented hexamer nucleotides, and DNA stability. Additionally, GPMiner allows users to input a group of genes for identification of co-occurring TFBSs in promoter sequences. All mined promoter regions and regulatory features in the user-input sequence are visualized graphically to facilitate analysis of gene transcriptional regulation. The details of the proposed method are as follows.

Figure 1

System flow of GPMiner.

Input genomic sequence

Users first input a genomic sequence in the FASTA format to identify promoter regions and to mine regulatory elements within the input sequence. The input sequence is used to search for homogeneity of experimental mammalian promoter sequences collected from the DBTSS (version 6.0) [10], EPD (release 80) [11] and Ensembl (version 61) [22]. All experimentally verified TSSs are using genomic positional information provided by DBTSS and EPD. By default, all the base pairs (bps) starting with the upstream 2000 bps to the downstream 200 bps relative to the TSS (+1) are defined as promoter regions and extracted for a sequence homology search. Notably, GPMiner collects 22774, 25420, 22159, 22475, and 18201 known genes from five mammalian genomes, including the human, mouse, rat, chimpanzee, and dog genomes, respectively. After the sequence homology search, the proposed system outputs a set of known genes with promoter sequences resembling the input sequence. Additionally, users can input the chromosomal location to specify sequence regions for mining regulatory features.

Promoter identification

The GPMiner system uses a SVM that considers orthologously conserved regulatory features, such as CpG islands, nucleotide composition, over-represented hexamer nucleotides, and DNA stability, of a promoter sequence to identify mammalian proximal promoters (Figure 2). The promoter length of mammalian cell is usually around 1000 bp [23]. Because some regulatory elements locate far from TSS, numerous cis-regulatory elements annotation system used 3000 bp upstream as the maximum region for analysis [24]. Furthermore, several studies indicate the downstream region of TSS play critical roles during transcription. Therefore, 3000 bp downstream of TSS are also selected to analyze. Consequently, experimentally identified promoters originating from human and mouse genomes collected from the DBTSS (Table S2, additional file 1) were mapped to Ensembl genomic positions, and flanking sequences of -3000 bps to +3000 bps around the mapped TSSs were selected. Furthermore, homologous promoter sequences between human and mouse genomes were analyzed using the BLAST program [25]. The sequence identity of homologous promoter sequences exceeding 80% were extracted and defined as training sequences. These training sequences were classified into two subgroups based on whether CpG islands were present by CpGProD [12]. Table S3 (in additional file 1) lists the statistics of the classified training set.

Figure 2

Analytical flowchart of promoter identification.

Analytical flowchart of promoter identification. After constructing and classifying the training set, training sequences are first analyzed with their nucleotide composition to calculate the occurrence rate of mono-, di-, and tri-mer nucleotides within a 20-bp window sliding along training sequences. Figure S1 (in additional file 1) lists average distributions of occurrence rates of nucleotide compositions. Pearson's correlation coefficient is calculated for clustering average distributions of mono-, di-, and tri-mer nucleotides into two groups based on the two major distributions of adenine and guanine (Table S4, additional file 1). Furthermore, training sequences are also used to extract over-represented 6-mer nucleotides within a specified window size around the TSSs, which comprise the so-called positive set. The occurrence probabilities of 6-mer nucleotides in the specified window are calculated and compared to background probabilities of the entire genome. By optimizing the number (50-200) of over-represented 6-mer nucleotides, the top 100 over-represented 6-mer nucleotides are selected as training features. Furthermore, DNA stability is a feature used for identifying promoter sequences. SantaLucia et al. [26] used the unified standard free energy of ten dinucleotide duplexes--AA/TT, AT/TA, TA/AT, CA/GT, GT/CA, CT/GA, GA/CT, CG/GC, GC/CG, and GG/CC [26] (Table S5, additional file 1)--to calculate the standard free energy change of a DNA oligonucleotide based on dinucleotide composition. This work applied the equation of standard free energy change to determine the stability of a DNA duplex with a window size of 15 nt sliding from -3000 to +3000, corresponding to the TSSs in training sequences. Figure S2 (in additional file 1) shows distributions of average free energy of DNA duplex formation. Near the TSS, a peak exists in the region starting from -10 to -30, which corresponds to the TATA box in eukaryotic promoter sequences. A public SVM library LIBSVM [27] is used to construct predictive models. The SVM kernel function is set to the radial basis function (RBF). Before using extracted regulatory features to train SVM models, the specified window sizes of proximal promoter regions, which comprise the so-called positive set, must be defined. Therefore, five window sizes--60 to +20, -100 to +50, -200 to +100, -300 to +150, and -400 to +200--are defined, and a benchmark is applied to evaluate the predictive performance of proximal promoter regions. The benchmark, namely, cross-validation, extracts equal sizes from the positive set and negative set, constructs the SVM model, and evaluates the model with k-fold cross-validation. Training sequences within the specified window are defined as the positive set; regions other than those in specified windows, with window sizes equal to those in the positive set, are chosen randomly as the negative set. Predictive performance of the constructed models is evaluated by five-fold cross-validation [28]. Training data are divided into five groups by splitting each dataset into five approximately equally sized subgroups. During cross-validation, each subgroup is used as the validation set in turn, and the remaining comprise the training set. Next, the measures of predictive performance of trained models are Precision (Prec) = TP/(TP+FP), Sensitivity (Sn) = TP/(TP+FN), Specificity (Sp) = TN/(TN+FP), and Accuracy (Acc) = (TP + TN)/(TP+FP+TN+FN), where TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative predictions, respectively. The constructed SVM models of three different regulatory features are measured, and models with the best predictive accuracy are selected for the mammalian proximal promoter prediction. Moreover, several promoter prediction tools, NNPP2.2 [14], Eponine [15] and McPromoter [16], are integrated into GPMiner to provide additional information about the proximal promoter, thereby improving predictive specificity.

Mining cis-regulatory features

After identifying proximal promoter regions, regulatory elements involving gene transcriptional regulation, such as transcription factor binding sites, CpG islands, the TATA box, CCAAT box, GC box, and over-represented sequences, are annotated. Furthermore, tandem repeats and DNA stability and GC content in the promoter region are provided for advanced analysis of gene transcriptional regulation. Table 1 shows the integrated databases and GPMiner tools for mining regulatory elements within input sequences. For instance, MATCH [29] was utilized for scanning TFBSs in an input sequence using the TF binding profiles from TRANSFAC public release version 7.0 [30] and JASPAR [31]. The CpGProD program [12] was applied to detect the CpG island in a promoter region with a prediction specificity of roughly 70%. A tandem repeat finder [32] was applied to identify tandem repeats in promoter sequences. In detecting the TFBS in promoter regions, cutoff values of core and matrix scores of the MATCH program are set to 1.0 and 0.7, respectively. Particularly, frequent regulatory elements, such as the TATA box, CCAAT box, and GC box, are represented separately.

Table 1

Supported regulatory features in GPMiner

Regulatory features	Integrated database or tools	Descriptions
Transcriptional start site	NNPP2.2 [14]	Applying a time-delay neural network for promoter annotation

	McPromoter [16]	Using a statistical method to identify eukaryotic polymerase II TSS in genomic DNA

	Eponine [15]	Predicting the transcription start site for a DNA sequence with prediction specificity > 70%

Transcription factor (TF) binding site	TRANSFAC public release 7.0 [46]	Storing the experimentally verified transcription factors, their genomic binding sites and DNA-binding profiles

	MATCH [29]	Scanning the transcription factor binding site using the transcription factor binding profiles from TRANSFAC public release 7.0 and JASPAR

CpG island	CpGProD [12]	Detecting the CpG island

Repeats	TRF [32]	A tandem repeat finder

TATA box, CCAAT box, and GC box	MATCH [29]	Scanning the TATA-, CCAAT- and GC-box by the transcription factor binding profiles from TRANSFAC

	Narang et al. [47]	Defining the 6-mer pattern of the TATA box, CCAAT box, and GX box with positional density

Over-represented pattern	Huang et al. [48]	Defining the statistically significant pattern in the promoter region

DNA stability	Aditi Kanhere et al. [21]	Predicting the DNA stability of the promoter region

Co-occurrence of TF binding sites	apriori [35]	A method to mine the association rules

Conserved regions between homologous gene promoter sequences	Blast [25]	Using the blast program to analyze the conserved region between the homologous gene promoter sequences

Supported regulatory features in GPMiner Several important regulatory features, such as repeats and over-represented oligonucleotides, are integrated. Repeats, such as tandem repeats, Alu, and L1 elements can alter OR the methylation distribution in a genome, and possibly in gene transcription [33,34]. The proposed system applies a statistical method to identify over-represented oligonucleotides (6-12 bps) in promoter regions; these over-represented oligonucleotides are identified by comparing their occurrence frequencies in promoter regions with their background occurrence frequencies throughout the whole genome (See additional file 1 for a detailed description). Based on statistical significance, this work chose the oligonucleotide with a Z-Score > 5 as the OR sequence. Moreover, DNA stability distributions are provided. The GC contents are also calculated using a window size of 15 nt and used as references for identification of CpG islands.

Identifying co-occurrence of TFBSs in a group of gene promoters

The GPMiner functionalities allow users to input a group of genes to mine co-occurrence of TFBSs in promoter regions. A mining association rules method, namely, a priori [35], is applied to mine the co-occurrence of TFBSs in a group of gene promoter sequences. Consider a large database with transactions, in which each transaction consists of a set of items. An association rule is an expression, such as A ≥ B, where A and B are item sets. The related mining association rule states that a transaction in a database containing A also contains B. For example, 90% of people who purchase beer also purchase diapers. Herein, 90% is rule confidence. Support of the A ≥ B rule used is the percentage of transactions containing both A and B. The formal problem statement is as follows. Let S = {s1, s2, ..., sm} be a set of known TFBSs of the human genome. The union of members in the set S is called the item set. Let G = {g1, g2, ..., gm} be a group of genes with differential expression in a specific tissue. Each promoter region of a gene is mapped to a transaction containing a set of known regulatory sites, also called items. We assume promoter region S contains A, a set of items of I, when A ⊆ S. An association rule is an implication of the relationship A ≥ B, where A ⊂ I, B ⊂ I, and A ∩ B = ϕ. The A ≥ B rule holds in the set of promoter regions D with confidence conf when c% of transactions in D contains both A and B. The A ≥ B rule has support sup in the repetitive sequence set D when s% of promoter regions in D contains A ∪ B. The association rules, the so-called co-occurrence of TFBSs, are generated when a rule has higher support and confidence than those specified by a user. After mining co-occurrences (combinations) of TFBSs in a group of gene promoter sequences, the statistical significance each combination must be examined against the background set of genes using the hypergeometric model: where K is the number of background gene promoters used, T is the number of observed gene promoters input by users, k is the number of promoters that have the combination in the background gene set, and t is the number of promoters that have the combination in the observed gene set. The P-value is calculated for each combination based on the hypermetric equation--the P-value decreases, statistical significance increases.

Graphical visualization

After mining proximal promoter regions and regulatory features, all mined regulatory features are presented graphically in the web interface, which is constructed using the GD library and PHP programming language. To simplify graphical visualization, regulatory features with numerous entries are presented initially in an overview form. Regulatory features are displayed in detail when users click the "detailed view" button. Additionally, detailed information of regulatory features is listed in tabular form. The co-occurrences of TFBSs in a set of gene promoter sequences are also represented graphically. When users investigate promoters of known genes, conserved regions of homologous gene promoters are displayed graphically, as are regulatory features found in conserved promoter regions. The graphical visualization of regulatory elements facilitates analysis of gene transcription regulation.

Utilities and discussion

Performance of promoter identification

A benchmark, namely, cross-validation, is used to evaluate the predictive performance of GPMiner, which incorporates an SVM with nucleotide composition, over-represented hexamer nucleotides, and DNA stability for mammalian proximal promoter identification. The benchmark is used to extract equal sizes of the positive set and negative set, construct the SVM model, and evaluate the model with k-fold cross-validation (k = 5). Table S6 (in additional file 1) lists the prediction performance of the constructed SVM models trained with three different regulatory features based on the five window sizes. Since training sequences are classified into two subgroups by CpG islands--with CpG islands and without CpG islands--predictive performance of group with CpG islands is markedly higher than that of the group without CpG islands; furthermore, as window size increases, the prediction performance of SVM models increases. However, after considering both prediction performance and window size, a window size of -200 to +100 is selected as the specified window for identifying proximal promoter regions. Vertebrate gene expression is frequently regulated by the proximal promoter, which is traditionally defined as between -200 bp and the TSS [36]. Table 2 lists the predictive performance of SVM models trained with combinations of the three different regulatory features, such as over-represented hexamer nucleotides (OR), nucleotide composition (NC), and DNA stability (DS). Three training sets, "all", with CpG islands, and without CpG islands, are evaluated by benchmark cross-validation, and based on the specified window size of 200 to 100 relative to the TSS (+ 1). In all three training sets, the combination OR+NC+DS performs better than other combinations. Moreover, the training set, namely, that with CpG islands, which achieves a predictive accuracy of 82%, performs better than training sets of "all" and without CpG islands. Both SVM models trained with the training sets with CpG islands and without CpG islands are used for proximal promoter identification. Whether an input sequence contains a CpG island is then detected, and the sequence is then predicted by the SVM model with CpG islands or the SVM model without CpG islands.

Table 2

Training set	Window size	Features	Precision	Sensitivity	Specificity	Accuracy
All(6,452)	-200 ~+100	OR+NC	77%	71%	79%	75%
	-200 ~+100	OR+DS	76%	69%	78%	74%
	-200 ~+100	NC+DS	75%	74%	76%	75%
	-200 ~+100	OR+NC+DS	79%	76%	79%	78%

With CpG(4,898)	-200 ~+100	OR+NC	79%	81%	79%	80%
	-200 ~+100	OR+DS	77%	80%	76%	78%
	-200 ~+100	NC+DS	77%	82%	75%	78%
	-200 ~+100	OR+NC+DS	80%	84%	79%	82%

Without CpG (1,554)	-200 ~+100	OR+NC	68%	70%	67%	68%
	-200 ~+100	OR+DS	68%	71%	66%	68%
	-200 ~+100	NC+DS	66%	67%	66%	66%
	-200 ~+100	OR+NC+DS	69%	69%	71%	70%

The number of training sequences used to construct the SVM models is shown in parenthesis of the column "Training set".

The prediction performance of SVM models with combinations of three kinds of regulatory features such as over-represented hexamer nucleotides (OR), nucleotide composition (NC), and DNA stability (DS), is evaluated by benchmark "Cross-validation" based on the specified window size -200 to +100 of TSS(+1). The number of training sequences used to construct the SVM models is shown in parenthesis of the column "Training set". Notably, GPMiner lets users input a novel sequence to annotate the proximal promoter region with the putative TSS. Thus, 1871 human promoter sequences (from -3000 to +3000) in the EPD comprise the independent test set used to evaluate predictive performance. The test sequences whose regions are within -200 to +100 relative to the TSSs (+1) are defined as a positive set; otherwise, the negative set is extracted randomly from regions other than those in the positive set. Table S7 (in additional file 1) compares the predictive performance of GPMiner and those of NNPP2.2, Eponine, and McPromoter. Furthermore, Figure S3 (in additional file 1) shows the distribution of promoter predictions of GPMiner, NNPP2.2, Eponine, and McPromoter. The sensitivity of GPMiner is better than that of the other methods; however, predictive specificity of McPromoter and Eponine are better than that of GPMiner. With consideration of high specificity, NNPP2.2, Eponine, and McPromoter are integrated to reduce the number of false-positive predictions.

Web interface

The GPMiner system has two primary functions. First, "gene group analysis" is adopted to identify co-occurrence of TFBSs in a group of gene promoters. Combinatorial regulation by TF complexes is an important feature of eukaryotic gene regulation [5,37,38]. Second, "promoter analysis" can be employed to analyze TFBSs, CpG islands, tandem repeats, the presence of a TATA box, CCAAT box, or GC box, statistically over-represented sequence patterns, GC content (GC%) and DNA stability in the promoter sequence of a given gene ID or a novel promoter sequence. Furthermore, cross-species analysis of homologous gene promoters is performed by GPMiner, such that conserved regulatory features in promoter regions can also be observed. Figure 3 shows the web interfaces of GPMiner. In the submission interface, users first choose one of five mammals, such as human, mouse, rat, chimpanzee or dog, and input a genomic sequence or chromosomal location for identifying proximal promoter regions and for mining regulatory features. Eight regulatory features currently exist in GPMiner. By default, all regulatory features are chosen for annotation in the input sequence. Notably, users can input a chromosome location to specify regions of interest for retrieving genes located in this chromosome region. During the mining process, the proposed system uses various tools individually to annotate different regulatory features in an input sequence. Each annotating tool for regulatory features has some search parameters, such as score threshold in NNPP2.2, Eponine, and McPromoter, the core score and matrix score for the MATCH program, Z-score for over-represented oligonucleotides, and support and confidence scores for co-occurrence TFBSs analysis, in a gene group search. Default parameters for these tools are set and the related documentation is shown on the help webpage. After mining regulatory features, a graphical visualization of identified promoter regions and mined regulatory features is provided to users. Figures S4 and S5 (see additional file 1) present graphical representations of regulatory elements for known gene promoter and homologous promoter sequences, respectively.

Figure 3

The submission and result interface of GPMiner.

Case studies

Figure 4 shows an example gene group analysis. Notably, NFkappaB is a well-known induced TF that controls kinetically complex patterns of gene expression in multiple pathways in human. In a previous study, ATM, EP300, FGFB1, and SFN were regulated by NF-kappaB and co-regulated by the Ets TF in the progression of various cancers [39]. To effectively apply GPMiner, four gene names were input for gene group analysis by GPMiner to detect co-occurring TFBSs. The thresholds of the core score and matrix score values in TFBS scanning were 1.0 and 0.9, respectively, and the support and confidence values in co-occurrence analysis were set both at 90%. Notably, NF-kappaB and Ets are also identified as combinatorial TFs in these four gene promoters after three analytical steps by GPMiner. This effective result was confirmed by known regulatory pathways [39]. Therefore, GPMiner accurately identifies TFBSs in a set of gene promoters. The proposed system can be applied to analyze co-regulation in microarray gene-expression databases such as COXPRESdb [40] and Genevestigator [41]. The proposed GPMiner system improves our understanding of transcription regulatory networks of gene regulation in mammalians.

Figure 4

Gene group analysis in GPMiner.

Gene group analysis in GPMiner. Moreover, to demonstrate the application of single promoter analysis, a case study involving humans is described below. The v-fos FBJ murine osteosarcoma viral oncogene homolog (gene symbol is FOS) gene is a regulator of cell proliferation, differentiation, and transformation [42]. Through experimentally verified annotation of the Entrez Gene database, the FOS gene is regulated by numerous transcription factors such as SP1, SRF, SAP-1, and AP-1. Additionally, the FOS gene exhibited DNA methylation based on information in the Gene Ontology database. The FOS gene promoter sequence was extracted and input into GPMiner to mine the proximal promoter region and annotate regulatory elements. The DNA stability of the input sequence is graphically represented and the proximal promoter region is highlighted (Figure S2, additional file 1). Using the TSS prediction tool Eponine, potential TSSs are located near positions 500 and 2000 bps. The CpG islands were annotated, as were numerous TFs that may regulate the FOS gene promoter, including SP1, SRF, SAP-1, and AP-1. Moreover, the TATA box was annotated near position 2000 bps. To summarize annotated regulatory features, the proximal promoter region is likely located near 2000 bps since the experimentally validated TSS of the FOS gene was located at 2001 bps.

Conclusions

The GPMiner system has a gene group analysis function for analyzing the co-occurrence of TFBSs with statistical measures in a set of co-expressed genes. This function uses a practical platform to examine co-expression genes of microarray data in transcriptional regulation networks. Furthermore, the GPMiner system has a user-friendly input/output interface, and has numerous advantages in mammalian promoter analysis. The proposed system incorporates an SVM with nucleotide composition over-represented hexamer nucleotides and DNA stability for mammalian proximal promoter identification and mines regulatory elements, including TSSs, TFBSs, CpG islands, tandem repeats, the TATA box, CCAAT box, GC box, statistically over-represented sequence patterns, GC content (GC%) and DNA stability. Evaluated by benchmark cross-validation, the predictive sensitivity and specificity of GPMiner are roughly 80%. All mined promoter regions and regulatory features in the user input sequence are graphically visualized to facilitate gene transcription analysis. Table 3 compares the functions of several representative programs for promoter annotation with those of GPMiner.

Table 3

Comparison of GPMiner with several representative gene promoter annotation programs

Transcriptional regulatory features	PromoSer [20]	PromH [49]	DragonGSF [13]	McPromoter [16]	GPMiner
Species supported	Human, mouse, and rat	Human and mouse	Mammalian	Eukaryote	Human, mouse, rat, chimp, and dog
Promoter identification	Yes	Yes	Yes	Yes	Yes
Map to known gene promoters	Yes	-	-	-	DBTSS, EPD and Ensembl
Transcription factor binding site	-	Yes	Yes	-	TRANSFAC public release and JASPAR, MATCH
TATA-box	-	Yes	-	Yes	Yes
Tandem repeat	Yes	-	-	Yes	Tandem Repeat Finder
CpG island	-	-	Yes	-	CpGProD
Over-represented pattern	-	-	-	-	Yes
DNA stability	-	-	-	-	Yes
GC content	-	-	Yes	-	Yes
Co-occurrence of TFBSs	-	Yes	-	-	Yes
Graphical view	Yes	-	-	Yes	Yes

Comparison of GPMiner with several representative gene promoter annotation programs The Functional Annotation of the Mouse 3 (FANTOM3) [43] provides comprehensive experimentally identified TSSs of human and mouse genomes by cap analysis of gene expression (CAGE) [44]. The comprehensive TSSs of CAGE may be used to analyze promoters in advance. In addition to DNA stability, several structural properties of the DNA duplex in the promoter region, such as DNA curvature and bendability [45], should be analyzed and applied to predict identify gene promoter regions in mammals. Future versions of GPMiner will include detailed information about gene regulation such as microarray gene-expression profiles. The GPMiner system will be maintained and updated continuously.

Availability

The GPMiner web server will be continuously maintained and updated. The web server is now freely available at http://GPMiner.mbc.nctu.edu.tw/.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TYL and WCC conceived and supervised the project. TYL, JBKH, and DMS was responsible for the design, computational analyses, implemented the web-based tool, and drafted the manuscript with revisions provided by WCC and THC. All authors read and approved the final manuscript.

Additional file 1

Additional figures and tables. Contains additional figures and tables showing further results in the study. Click here for file

47 in total

1. Interpolated markov chains for eukaryotic promoter recognition.

Authors: U Ohler; S Harbeck; H Niemann; E Nöth; M G Reese
Journal: Bioinformatics Date: 1999-05 Impact factor: 6.937

2. GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox.

Authors: Philip Zimmermann; Matthias Hirsch-Hoffmann; Lars Hennig; Wilhelm Gruissem
Journal: Plant Physiol Date: 2004-09 Impact factor: 8.340

3. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

4. Promoter prediction analysis on the whole human genome.

Authors: Vladimir B Bajic; Sin Lam Tan; Yutaka Suzuki; Sumio Sugano
Journal: Nat Biotechnol Date: 2004-11 Impact factor: 54.908

5. Tandem repeats finder: a program to analyze DNA sequences.

Authors: G Benson
Journal: Nucleic Acids Res Date: 1999-01-15 Impact factor: 16.971

6. TRANSFAC database as a bridge between sequence data libraries and biological function.

Authors: E Wingender; H Karas; R Knüppel
Journal: Pac Symp Biocomput Date: 1997

7. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics.

Authors: J SantaLucia
Journal: Proc Natl Acad Sci U S A Date: 1998-02-17 Impact factor: 11.205

8. Role of DNA 5-methylcytosine transferase in cell transformation by fos.

Authors: A V Bakin; T Curran
Journal: Science Date: 1999-01-15 Impact factor: 47.728

9. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes.

Authors: Jeffrey S Han; Suzanne T Szak; Jef D Boeke
Journal: Nature Date: 2004-05-20 Impact factor: 49.962

10. Identifying combinatorial regulation of transcription factors and binding motifs.

Authors: Mamoru Kato; Naoya Hata; Nilanjana Banerjee; Bruce Futcher; Michael Q Zhang
Journal: Genome Biol Date: 2004-07-28 Impact factor: 13.583

24 in total

1. Interferons Induce Expression of SAMHD1 in Monocytes through Down-regulation of miR-181a and miR-30a.

Authors: Maximilian Riess; Nina V Fuchs; Adam Idica; Matthias Hamdorf; Egbert Flory; Irene Munk Pedersen; Renate König
Journal: J Biol Chem Date: 2016-12-01 Impact factor: 5.157

2. Calcium Channel Subunit α2δ4 Is Regulated by Early Growth Response 1 and Facilitates Epileptogenesis.

Authors: Karen M J van Loo; Christine K Rummel; Julika Pitsch; Johannes Alexander Müller; Arthur F Bikbaev; Erick Martinez-Chavez; Sandra Blaess; Dirk Dietrich; Martin Heine; Albert J Becker; Susanne Schoch
Journal: J Neurosci Date: 2019-02-21 Impact factor: 6.167

3. Multi-platform analysis reveals a complex transcriptome architecture of a circovirus.

Authors: Norbert Moldován; Zsolt Balázs; Dóra Tombácz; Zsolt Csabai; Attila Szűcs; Michael Snyder; Zsolt Boldogkői
Journal: Virus Res Date: 2017-05-24 Impact factor: 3.303

4. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction.

Authors: Meng Zhang; Cangzhi Jia; Fuyi Li; Chen Li; Yan Zhu; Tatsuya Akutsu; Geoffrey I Webb; Quan Zou; Lachlan J M Coin; Jiangning Song
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

5. Hepatitis B Virus Variants with Multiple Insertions and/or Deletions in the X Open Reading Frame 3' End: Common Members of Viral Quasispecies in Chronic Hepatitis B Patients.

Authors: Selene García-García; Andrea Caballero-Garralda; David Tabernero; Maria Francesca Cortese; Josep Gregori; Francisco Rodriguez-Algarra; Josep Quer; Mar Riveiro-Barciela; Maria Homs; Ariadna Rando-Segura; Beatriz Pacin-Ruiz; Marta Vila; Roser Ferrer-Costa; Tomas Pumarola; Maria Buti; Francisco Rodriguez-Frias
Journal: Biomedicines Date: 2022-05-21

6. Genetic and epigenetic determinants mediate proneness of oncogene breakpoint sites for involvement in TCR translocations.

Authors: N S D Larmonie; A van der Spek; A J J C Bogers; J J M van Dongen; A W Langerak
Journal: Genes Immun Date: 2013-12-05 Impact factor: 2.676

7. Deletion in 2q35 excluding the IHH gene leads to fetal severe limb anomalies and suggests a disruption of chromatin architecture.

Authors: Aurélien Trimouille; Angèle Tingaud-Sequeira; Perrine Pennamen; Gwenaelle André; Julie Bouron; Cécile Boucher; Patricia Fergelot; Didier Lacombe; Benoit Arveiler; Caroline Rooryck
Journal: Eur J Hum Genet Date: 2018-10-29 Impact factor: 4.246

8. Mitochondrial genomes of African pangolins and insights into evolutionary patterns and phylogeny of the family Manidae.

Authors: Zelda du Toit; Morné du Plessis; Desiré L Dalton; Raymond Jansen; J Paul Grobler; Antoinette Kotzé
Journal: BMC Genomics Date: 2017-09-21 Impact factor: 3.969

9. A putative MYB35 ortholog is a candidate for the sex-determining genes in Asparagus officinalis.

Authors: Daisuke Tsugama; Kohei Matsuyama; Mayui Ide; Masato Hayashi; Kaien Fujino; Kiyoshi Masuda
Journal: Sci Rep Date: 2017-02-08 Impact factor: 4.379

10. Full-Length Isoform Sequencing Reveals Novel Transcripts and Substantial Transcriptional Overlaps in a Herpesvirus.

Authors: Dóra Tombácz; Zsolt Csabai; Péter Oláh; Zsolt Balázs; István Likó; Laura Zsigmond; Donald Sharon; Michael Snyder; Zsolt Boldogkői
Journal: PLoS One Date: 2016-09-29 Impact factor: 3.240