| Literature DB >> 34789306 |
Ruben Chevez-Guardado1, Lourdes Peña-Castillo2,3.
Abstract
Promoters are genomic regions where the transcription machinery binds to initiate the transcription of specific genes. Computational tools for identifying bacterial promoters have been around for decades. However, most of these tools were designed to recognize promoters in one or few bacterial species. Here, we present Promotech, a machine-learning-based method for promoter recognition in a wide range of bacterial species. We compare Promotech's performance with the performance of five other promoter prediction methods. Promotech outperforms these other programs in terms of area under the precision-recall curve (AUPRC) or precision at the same level of recall. Promotech is available at https://github.com/BioinformaticsLabAtMUN/PromoTech .Entities:
Keywords: Bacterial promoter; Bioinformatics; Machine learning; Microbiology; Promoter prediction; Promoter recognition
Mesh:
Year: 2021 PMID: 34789306 PMCID: PMC8597233 DOI: 10.1186/s13059-021-02514-9
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Flowchart illustrating our methodology
Summary of data sets used
| Bacterium | Genome accession | PubMed ID | NGS technology | #TSS | Genome length | T or V |
|---|---|---|---|---|---|---|
| NC_000913.3 | 27748404 [ | dRNA-seq | 278 | 4,641,652 | T | |
| NC_000913.2 | 25266388 [ | dRNA-seq | 2672 | 4,639,675 | T | |
| NC_000915.1 | 20164839 [ | dRNA-seq | 1907 | 1,667,867 | T | |
| NC_000915.1 | 30169674 [ | dRNA-seq | 449 | 1,667,867 | T | |
| NC_009839.1 | 30169674 [ | dRNA-seq | 269 | 1,628,115 | T | |
| NC_002163.1 | 23696746 [ | dRNA-seq | 1905 | 1,641,481 | T | |
| NC_003912.7 | 23696746 [ | dRNA-seq | 2167 | 1,777,831 | T | |
| NC_009839.1 | 23696746 [ | dRNA-seq | 1944 | 1,628,115 | T | |
| NC_008787.1 | 23696746 [ | dRNA-seq | 2003 | 1,616,554 | T | |
| LR031521.1 | 30902048 [ | dRNA-seq | 892 | 1,877,450 | T | |
| NC_016810.1 | 22538806 [ | dRNA-seq | 1873 | 4,878,012 | T | |
| NC_000922.1 | 21989159 [ | dRNA-seq | 530 | 1,230,230 | T | |
| NC_004347.2 | 24987095 [ | dRNA-seq | 4729 | 4,969,811 | T | |
| NZ_LT962963.1 | 28154810 [ | dRNA-seq | 2865 | 4,614,703 | T | |
| NC_003888.3 | 27251447 [ | dRNA-seq | 3570 | 8,667,507 | T | |
| NC_008596.1 | 30984135 [ | dRNA-seq | 4054 | 6,988,209 | V | |
| NC_010001.1 | 27982035 [ | Cappable-seq | 1187 | 4,847,594 | V | |
| NC_014034.1 | – [ | dRNA-seq | 5374 | 3,738,958 | V | |
| CP002927.1 | 26133043 [ | dRNA-seq | 1064 | 3,939,203 | V |
In the last column, a T or V indicates whether the bacterium is reserved for training or validation, respectively. Additional information is included such as the number of TSS per bacterium, the genome’s length, the next-generation sequencing technology used to obtain the TSSs, and the literature sources’ PubMed ID (if PubMed ID is missing, then at the time of this publication, the source manuscript was still in preparation
Training data set’s characteristics
| Bacterial species | Phylum | GC content (%) |
|---|---|---|
| Actinobacteria | 71.98 | |
| Chlamydiae | 40.6 | |
| Firmicutes | 38.4 | |
| Proteobacteria | 52.1 | |
| Proteobacteria | 50.6 | |
| Proteobacteria | 46 | |
| Proteobacteria | 38.9 | |
| Proteobacteria | 30.4 | |
| Spirochaetes | 35 |
Validation data set’s characteristics
| Bacterial species | Phylum | GC content (%) |
|---|---|---|
| Actinobacteria | 67.4 | |
| Firmicutes | 46.4 | |
| Firmicutes | 35.6 | |
| Proteobacteria | 66.5 |
The AUPRC and AUROC obtained in 25% of the training data set left out for testing
| Models | AUPRC | AUROC |
|---|---|---|
| RF-HOT | ||
| RF-TETRA | 0.593 | 0.844 |
| GRU-0 | 0.752 | 0.929 |
| GRU-1 | ||
| GRU-2 | 0.753 | 0.929 |
| GRU-3 | 0.728 | 0.922 |
| GRU-4 | 0.728 | 0.923 |
| LSTM-0 | 0.734 | 0.923 |
| LSTM-1 | 0.744 | 0.927 |
| LSTM-2 | 0.739 | 0.923 |
| LSTM-3 | 0.924 | |
| LSTM-4 |
The data set used has a 1:10 ratio of positive to negative instances. The numbers in bold indicate the models with the highest AUPRC/AUROC per machine learning method
Impurity-based feature importance ranking generated by the RF-TETRA model
| Ranking | Tetra-nucleotide | Score |
|---|---|---|
| 1 | TATA | 0.023 ± 0.015 |
| 2 | ATAA | 0.014 ± 0.009 |
| 3 | TAAT | 0.014 ± 0.008 |
| 4 | TTAT | 0.011 ± 0.007 |
| 5 | AAAA | 0.010 ± 0.001 |
| 6 | TTTT | 0.010 ± 0.001 |
| 7 | GTTA | 0.009 ± 0.004 |
| 8 | TATT | 0.009 ± 0.004 |
| 9 | TAAA | 0.009 ± 0.002 |
| 10 | AATA | 0.008 ± 0.004 |
Permutation-based feature importance ranking generated by the RF-TETRA model
| Ranking | Tetra-nucleotide | Score |
|---|---|---|
| 1 | ATAA | 0.052 ± 0.001 |
| 2 | TATA | 0.048 ± 0.001 |
| 3 | TAAT | 0.046 ± 0.002 |
| 4 | TTAT | 0.039 ± 0.001 |
| 5 | GTTA | 0.036 ± 0.001 |
| 6 | TAAA | 0.035 ± 0.001 |
| 7 | AATA | 0.035 ± 0.001 |
| 8 | ATTA | 0.033 ± 0.001 |
| 9 | TATT | 0.031 ± 0.001 |
| 10 | AATT | 0.031 ± 0.000 |
Fig. 2Impurity-based feature importance scores per nucleotide per position relative to the TSS as calculated from the RF-HOT model
Fig. 3Permutation-based feature importance scores per nucleotide per position relative to the TSS as calculated from the RF-HOT model
Average AUPRC and AUROC ± standard deviation obtained per model across the validation set when requiring that predicted promoters have at least 10% sequence overlap with the actual promoters to be considered true positives
| Model | Mean AUPRC | Mean AUROC |
|---|---|---|
| RF-HOT | ||
| GRU-0 | 0.026 ± 0.016 | 0.687 ± 0.063 |
| GRU-1 | 0.025 ± 0.017 | 0.675 ± 0.048 |
| LSTM-3 | 0.034 ± 0.017 | 0.677 ± 0.019 |
| LSTM-4 | 0.033 ± 0.023 | 0.631 ± 0.040 |
The numbers in bold indicate the model with the highest performance
Fig. 4Predicted promoters observed in actual promoters’ proximity but not overlapping. Blue squares on the first row indicate the location of actual promoters while blue squares on the second and third rows indicate the location of predicted promoters with a predicted probability of 0.6 and 0.5, respectively. Within each circle a predicted promoter cluster is shown
Fig. 5PR curves (a) and ROC curves (b) per model obtained when counting predicted promoters nearby actual promoters as true positives on M. smegmatis str. MC2 155 bacterium. The numbers between brackets beside the model ID indicate AUPRC (a) and AUROC (b) of that model
Fig. 6PR curves (a) and ROC curves (b) per model obtained when counting predicted promoters nearby actual promoters as true positives on L. phytofermentans ISDg bacterium. Numbers between brackets beside the model ID indicate AUPRC (a) and AUROC (b) of that model
Fig. 7PR curves (a) and ROC curves (b) per model obtained when counting predicted promoters nearby actual promoters as true positives on R. capsulatus SB 1003 bacterium. The numbers between brackets beside the model ID indicate AUPRC (a) and AUROC (b) of that model
Fig. 8PR curves (a) and ROC curves (b) per model obtained when counting predicted promoters nearby actual promoters as true positives on B. amyloliquefaciens XH7 bacterium. The numbers between brackets beside the model ID indicate AUPRC (a) and AUROC (b) of that model
Average AUPRC and AUROC ± standard deviation obtained per model across the validation set on the cluster promoter prediction task
| Model | Mean AUPRC | Mean AUROC |
|---|---|---|
| RF-HOT | ||
| GRU-0 | 0.384 ± 0.218 | 0.898 ± 0.030 |
| GRU-1 | 0.365 ± 0.219 | 0.894 ± 0.027 |
| LSTM-3 | 0.404 ± 0.199 | 0.900 ± 0.011 |
| LSTM-4 | 0.379 ± 0.200 | 0.867 ± 0.033 |
The numbers in bold indicate the highest AUPRC and AUROC
AUPRC per bacterial species and mean AUPRC ± standard deviation for each model
| Model | Mean AUPRC | ||||
|---|---|---|---|---|---|
| RF-HOT | 0.608 | 0.720 ± 0.161 | |||
| RF-TETRA | 0.800 | 0.608 | 0.678 | ||
| GRU-0 | 0.646 | 0.486 | 0.486 | 0.588 | 0.552 ± 0.079 |
| GRU-1 | 0.622 | 0.490 | 0.500 | 0.576 | 0.547 ± 0.063 |
| LSTM-3 | 0.625 | 0.499 | 0.494 | 0.559 | 0.544 ± 0.061 |
| LSTM-4 | 0.623 | 0.501 | 0.505 | 0.573 | 0.550 ± 0.059 |
| MULTiPly | 0.649 | 0.474 | 0.653 | 0.591 | 0.592 ± 0.083 |
| iPro70-FMWin | 0.652 | 0.582 | 0.774 | 0.594 | 0.65 ± 0.088 |
| bTSSFinder | (0.512, 0.272) | (0.507, 0.944) | (0, 0) | (0.513, 0.250) | NA |
| G4PromFinder | (0.506, 0.938) | (0.448, 0.216) | (0.382, 0.339) | (0.510, 0.960) | NA |
| BProm | (0.781, 0.006) | (0.501, 0.560) | (0.701, 0.421) | (0.615, 0.011) | NA |
AUPRC is roughly the weighted average precision across all recall levels. A perfect classifier has an AUPRC of 1, while a random classifier has an AUPRC of 0.5 in a balanced data set. These results were obtained in balanced data sets (i.e., with a 1:1 ratio of positive to negative instances). The numbers in bold indicate the model with the highest AUPRC. For BPROM, bTSSFinder, and G4PromFinder, the numbers between brackets indicate precision and recall achieved as these tools did not provide a probability associated to each instance in the data set
AUROC per bacterial species in the validation data set and mean AUROC ± standard deviation for each model
| Model | Mean AUROC | ||||
|---|---|---|---|---|---|
| RF-HOT | 0.591 | 0.640 | 0.660 | 0.708 ± 0.157 | |
| RF-TETRA | 0.814 | ||||
| GRU-0 | 0.630 | 0.488 | 0.496 | 0.577 | 0.548 ± 0.068 |
| GRU-1 | 0.601 | 0.487 | 0.502 | 0.566 | 0.539 ± 0.054 |
| LSTM-3 | 0.622 | 0.489 | 0.481 | 0.553 | 0.536 ± 0.066 |
| LSTM-4 | 0.592 | 0.498 | 0.506 | 0.546 | 0.536 ± 0.043 |
| MULTiPly | 0.684 | 0.470 | 0.700 | 0.593 | 0.612 ± 0.106 |
| iPro70-FMWin | 0.642 | 0.587 | 0.779 | 0.575 | 0.646 ± 0.093 |
| bTSSFinder | (0.272, 0.265) | (0.944, 0.924) | (0, 0) | (0.250, 0.245) | NA |
| G4PromFinder | (0.938, 0.932) | (0.216, 0.269) | (0.339, 0.554) | (0.960, 0.953) | NA |
| BProm | (0.006, 0.002) | (0.560, 0.398) | (0.421, 0.181) | (0.011, 0.007) | NA |
AUROC is roughly the likelihood that a positive instance will get a higher probability of being a promoter sequence than a negative instance. These results were obtained in data sets (not seen during training) with a 1:1 ratio of positive to negative instances. The numbers in bold indicate the model with the highest AUROC. For BPROM, bTSSFinder, and G4PromFinder, the numbers between brackets indicate true-positive rate and false-positive rate obtained as these tools did not provide a probability associated to each instance in the data set