| Literature DB >> 17148478 |
Thao T Tran1, Phuongan Dam, Zhengchang Su, Farris L Poole, Michael W W Adams, G Tong Zhou, Ying Xu.
Abstract
Identification of operons in the hyperthermophilic archaeon Pyrococcus furiosus represents an important step to understanding the regulatory mechanisms that enable the organism to adapt and thrive in extreme environments. We have predicted operons in P.furiosus by combining the results from three existing algorithms using a neural network (NN). These algorithms use intergenic distances, phylogenetic profiles, functional categories and gene-order conservation in their operon prediction. Our method takes as inputs the confidence scores of the three programs, and outputs a prediction of whether adjacent genes on the same strand belong to the same operon. In addition, we have applied Gene Ontology (GO) and KEGG pathway information to improve the accuracy of our algorithm. The parameters of this NN predictor are trained on a subset of all experimentally verified operon gene pairs of Bacillus subtilis. It subsequently achieved 86.5% prediction accuracy when applied to a subset of gene pairs for Escherichia coli, which is substantially better than any of the three prediction programs. Using this new algorithm, we predicted 470 operons in the P.furiosus genome. Of these, 349 were validated using DNA microarray data.Entities:
Mesh:
Year: 2006 PMID: 17148478 PMCID: PMC1761436 DOI: 10.1093/nar/gkl974
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Schematic illustration of a one-layer NN architecture with three inputs from existing programs. The confidence values x of each operon prediction program are inputs into a neuron consisting of a summation unit and a transfer function, f, to produce an output a.
Three-fold cross-validation results for E.coli and B.subtilis
| No. of inputs | 3-Fold cross-validation results | ||
|---|---|---|---|
| AUROC | |||
| — | JPOP | 0.8967 | 0.8568 |
| — | OFS | 0.9105 | 0.8381 |
| — | VIMSS | 0.9044 | 0.6207 |
| 3 | 3 only | 0.9225 | 0.8787 |
| 4 | 3 + GO | 0.9262 | 0.8797 |
| 4 | 3 + pathway | 0.9279 | 0.8798 |
| 4 | 3 + intergenic | 0.9170 | 0.8926 |
| 5 | 3 + GO + pathway | 0.8802 | |
| 5 | 3 + GO + intergenic | 0.9218 | 0.8955 |
| 5 | 3 + pathway + intergenic | 0.9267 | 0.8948 |
| 6 | 3 + GO + pathway + intergenic | 0.9275 | |
The area under the receiver operating curve (AUROC) is given for the three existing programs (JPOP, OFS, VIMSS) and different sets of inputs into the NN. The number of inputs along with the different combinations of inputs is given from 3-fold cross-validation for each organism. The ‘3 only’ represents the use of only the confidence scores of the three existing programs. The ‘3 + GO’ represents the use of the three existing programs and GO similarity score for a total of 4 inputs into the NN. Combinations using the pathway score and the intergenic distance scores are also given similarly. The NN was fixed to be a simple one-layer 1-neuron NN with transfer function f = logsig. The majority of the improvement is realized by just combining the confidence scores from the three programs (3 only); however, there is further improvement by including other features such as GO similarity score, KEGG pathway score and intergenic distance. The highest AUROC for each organism is in boldface.
Results of testing on E.coli after fixing network parameters and threshold from B.subtilis
| Program | Fixed threshold | TRAIN ( | TEST ( | ||||
|---|---|---|---|---|---|---|---|
| Sn | Sp | Accuracy | Sn | Sp | Accuracy | ||
| JPOP | 0.3427 | 0.7962 | 0.8147 | 0.8049 | 0.8819 | 0.7433 | 0.8341 |
| OFS | 0.7494 | 0.8025 | 0.7788 | 0.7914 | 0.8819 | 0.7647 | 0.8415 |
| VIMSS | 0.674 | 0.5892 | 0.6187 | 0.6030 | 0.8453 | 0.8182 | 0.8359 |
| NN (3 + GO + pathway) [1] | 0.4164 | 0.8519 | 0.7788 | 0.8176 | 0.9241 | 0.7273 | 0.8562 |
| NN (3 + GO + pathway + intergenic) [1] | 0.4756 | 0.8662 | 0.8219 | 0.8454 | 0.8903 | 0.7861 | 0.8544 |
| NN (3 + GO + pathway + intergenic) [2,1] | 0.5876 | 0.8328 | 0.8651 | 0.8480 | 0.8847 | 0.8262 | 0.8645 |
The table presents the existing programs and various combinations of inputs into the NN predictor. The number in brackets [.] following each NN predictor indicates the number of neurons used in each layer. For example, [1] represents a single-layer neuron with one neuron and [2,1] represents a two-layer neuron network with two neurons in the hidden layer and one neuron in the output layer. For each program the following are given: the fixed threshold from B.subtilis training, sensitivity (Sn), specificity (Sp) and accuracy. In the E.coli testing set, there is improvement in overall accuracy, sensitivity and specificity of the NN-based method over the existing three programs.
Figure 2(A) ROC for the B.subtilis training set. (B) ROC for E.coli using trained parameters from B.subtilis. Each plot displays the ROC from the three existing programs {JPOP, OFS, VIMSS}, the performance of using only the GO similarity score {GO}, the performance of using only the pathway score {pathway}, the performance of using only the log-likelihood score of the intergenic distance {intergenic}, and the performance of the NN-based predictor incorporating all of the aforementioned (6) features {NN}. The numbers in the legend correspond to the points indicated by an asterisk (*) in the plot showing each program's threshold that maximizes the (Sensitivity + Specificity) value. For any threshold, the NN-based method has higher performance than any of the existing programs.
Results of testing on P.furiosus using fixed network parameters and threshold from B.subtilis
| Program | Fixed threshold | Known operons | Microarray evidence list | ||||
|---|---|---|---|---|---|---|---|
| Sn | Sp | Accuracy | Sn | Sp | Accuracy | ||
| JPOP | 0.3427 | 0.8972 | 0.6129 | 0.8333 | 0.8198 | 0.5657 | 0.7453 |
| OFS | 0.7494 | 0.8505 | 0.7419 | 0.8261 | 0.5545 | 0.6936 | 0.5953 |
| VIMSS | 0.674 | 0.8972 | 0.7097 | 0.8551 | 0.7249 | 0.6970 | 0.7167 |
| NN | 0.5876 | 0.9907 | 0.5806 | 0.8986 | 0.9022 | 0.5354 | 0.7947 |
The results are from applying an optimal two-layer (two-neuron hidden layer with a tansig transfer function and a one output neuron with a logsig transfer function) NN. The NN-based method presented uses inputs from the three existing programs together with GO, pathway and intergenic scores. The sensitivity (Sn), specificity (Sp) and accuracy are given for each program under each test set. Known operons is a limited set of 33 known/putative operons from literature. The microarray evidence list is described in Microarray data analysis.
Characteristics of operons predicted by the NN-based method for each organism
| Organism | No. of ORFs | No. of operons | Average operon size | % Gene coverage |
|---|---|---|---|---|
| 2490 | 806 | 3.0893 | 59 | |
| 2288 | 747 | 3.0629 | 56 | |
| 1460 | 470 | 3.1064 | 69 |
For each organism, the number of open reading frames (ORFs) included in the operon prediction, the number of operons, the average operon size and the percent of gene coverage (=100 × no. of ORFs included in the operon prediction/total no. of ORFs in the organism) are given.
Figure 3Venn diagram of overlap between gene pairs for operons predicted from the NN-based method, the ‘microarray evidence list’ and the ‘putative operon list’. Predicted operons from the NN-based method overlapping the ‘microarray evidence list’ and the ‘putative operon list’ represent strong candidates for further experimental studies.