| Literature DB >> 29892645 |
Rafael Vieira Coelho1, Scheila de Avila E Silva2, Sergio Echeverrigaray2, Ana Paula Longaray Delamare2.
Abstract
This paper presents a prediction of Bacillus subtilis promoters using a Support Vector Machine system. In the literature, there is a lack of information on Gram-positive bacterial promoter sequences compared to Gram-negative bacteria. Promoter sequence identification is essential for studying gene expression. Initially, we collected the B. subtilis genome sequence from the NCBI database, and promoters were identified by their sigma factors in the DBTBS database. We then grouped the promoters according to 15 factors in 2 domains, corresponding to sigma 54 and sigma 70 of Gram-negative bacteria. Based on these data we developed a script in Python to search for promoters in the B. subtilis genome. After processing the data, we obtained 767 promoter sequences for B. subtilis, most of which were recognized by sigma SigA. To validate the data we found, we developed a software package called BacSVM+, which receives promoters as input and returns the best combination of parameters in a LibSVM library to predict promoter regions in the bacteria used in the simulation. All data gathered as well as the BacSVM+ software is available for download at http://bacpp.bioinfoucs.com/rafael/Sigmas.zip.Entities:
Keywords: Bacillus subtilis; Promoter sequences; SVM
Year: 2018 PMID: 29892645 PMCID: PMC5993011 DOI: 10.1016/j.dib.2018.05.025
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Sigma factors of B. subtilis[16].
| sigma54 | SigL | RNA polymerase sigma-54 (Sigma L) | 6 |
| sigma70 | SigA | RNA polymerase major sigma-43 (Sigma A). Essential gene. | 358 |
| SigB | RNA Polymerase sigma-37 (Sigma-B). General stress factor sigma. | 67 | |
| SigD | RNA polymerase sigma-28 (Sigma D). Autolytic enzymes; defect in flagellar synthesis. | 30 | |
| SigE | RNA polymerase sporulation-specific sigma-29. Processed by SpoIIGA after Tyr-27. | 83 | |
| SigF | Synthesized shortly after the onset of sporulation but do not become active until after polar division. | 30 | |
| SigG | Control of transcription in the forespore at late stages of sporulation. | 61 | |
| SigH | RNA polymerase sigma-30. Non-essential sigma factor involved in expression of vegetative and early stationary-phase genes. | 24 | |
| SigI | Temperature-sensitive growth in a null mutant; transcription induced by heat shock in rich medium but not in minimal medium; reduced amount of GsiB protein in a sigI mutant under heat shock conditions. | 1 | |
| SigK | Formed by a site-specific recombination event that joins the previously separated spoIVCB and spoIIIC genes into a single cistron. | 59 | |
| SigM | Essential for growth and survival in high concentrations of salt; expression maximal during exponential growth and increased in high concentrations of salt; activity negatively regulated by YhdL and YhdK. | 7 | |
| SigW | ECF-type sigma factor that mediates the transcriptional response to cell wall stress. | 34 | |
| SigX | RNA polymerase SigX. | 15 | |
| SigY | RNA polymerase ECF(extracytoplasmic function)-type sigma factor | 2 | |
| YlaC | RNA polymerase ECF(extracytoplasmic function)-type sigma factor | 1 |
Fig. 1Number of operons per sigma factor of B. subtilis. The X-axis shows the sigma factors. The Y-axis shows the number of operons.
List of SigL operons [14].
Fig. 2Example of promoter sequence selection from the acuABC operon of SigA in the B. subtilis genome.
Configuration Parameters of BacSVM+.
| gamma (G) | set gamma in kernel function (default is 1/num_features) |
| cost (C) | only in C-SVC, epsilon-SVR, and nu-SVR (default is 1) |
| svm type | C_SVC (default), NU_SVC, ONE_CLASS, EPISILON_SVR and NU_SVR |
| kernel type | set type of kernel function |
| coef0 | set coefficient zero in kernel function (default 0) |
| degree | set degree in kernel function (default 3) |
| nu | only in nu-SVC, one-class SVM, and nu-SVR (default 0.5) |
| cache size | cache memory size in MB (default 100) |
| epsilon | tolerance of termination criterion (default 0.001) |
| shrinking | whether to use the shrinking heuristics |
| probability | whether to train an SVC or SVR model for probability estimates |
| weight | set the parameter C of class i to weight*C, for C-SVC (default 1) |
SVM results.
| C-SVC | SIGMOID | 0.0625 | 1.52587890625E−5 | 82.04 | 94.17 | 69.90 |
| C-SVC | SIGMOID | 1.0 | 0.00390625 | 85.44 | 86.41 | 84.47 |
| C-SVC | SIGMOID | 16.0 | 2.44140625E−4 | 87.86 | 88.35 | 87.38 |
| C-SVC | RBF | 1.0 | 2.44140625E−4 | 82.04 | 94.17 | 69.90 |
| C-SVC | RBF | 0.0625 | 0.00390625 | 86.41 | 98.06 | 74.76 |
| C-SVC | RBF | 16.0 | 2.44140625E−4 | 91.26 | 92.23 | 90.29 |
| C-SVC | LINEAR | 0.00390625 | 1.52587890625E−5 | 87.86 | 88.35 | 87.38 |
| NU-SVC | SIGMOID | 16.0 | 0.0625 | 57.28 | 54.37 | 60.19 |
| NU-SVC | SIGMOID | 1.0 | 0.00390625 | 93.20 | 94.17 | 92.23 |
| NU-SVC | RBF | 256.0 | 2.44140625E−4 | 95.63 | 96.12 | 95.15 |
| ONE-CLASS | SIGMOID | 1.0 | 0.00390625 | 23.79 | 0.0 | 32.67 |
| ONE-CLASS | SIGMOID | 1.0 | 1.52587890625E−5 | 24.27 | 0.0 | 32.47 |
| ONE-CLASS | SIGMOID | 0.0625 | 1.0 | 48.54 | 0.0 | 96.15 |
| ONE-CLASS | RBF | 16.0 | 0.0625 | 20.87 | 0.0 | 26.54 |
| ONE-CLASS | RBF | 16.0 | 1.52587890625E−5 | 21.84 | 0.0 | 30.41 |
| ONE-CLASS | RBF | 65,536.0 | 0.00390625 | 24.76 | 0.0 | 34.46 |
* Cost (C), Gamma (G), Accuracy (A), Specificity (S) and Sensibility (SN).
| Subject area | biology |
| More specific subject area | promoter sequences |
| Type of data | text file |
| How data was acquired | script developed in Python |
| Data format | Raw |
| Experimental factors | not applicable |
| Experimental features | We collected the genome and promoter sequences recognized by |
| Data source location | not applicable |
| Data accessibility | |
| Related research article | Silva et al. |