| Literature DB >> 36147675 |
Lucas Coppens1,2, Laura Wicke2,3, Rob Lavigne2.
Abstract
Data availability is a consistent bottleneck for the development of bacterial species-specific promoter prediction software. In this work we leverage genome-wide promoter datasets generated with dRNA-seq in the Gram-negative bacteria Pseudomonas aeruginosa and Salmonella enterica for promoter prediction. Convolutional neural networks are presented as an optimal architecture for model training and are further modified and tailored for promoter prediction. The resulting predictors reach high binary accuracies (95% and 94.9%) on test sets and outperform each other when predicting promoters in their associated species. SAPPHIRE.CNN is available online and can also be downloaded to run locally. Our results indicate a dependency of binary promoter classification on an organism's GC content and a decreased performance of our classifiers on genera they were not trained for, further supporting the need for dedicated, species-specific promoter classification tools.Entities:
Year: 2022 PMID: 36147675 PMCID: PMC9478156 DOI: 10.1016/j.csbj.2022.09.006
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1a) Per base average sequence content of regions upstream of TSSs and σ70 promoter motifs found in these regions. b) σ70 motif scores and clustering for all TSSs obtained by dRNA-seq for P. aeruginosa. c) σ70 motif scores and clustering for all TSSs obtained by dRNA-seq for S. enterica.
Fig. 2a) Lowest loss and corresponding sensitivity and specificity achieved on the validation set encountered during training for five different types of neural networks on the P. aeruginosa dataset. b) Lowest loss and corresponding sensitivity and specificity achieved on the validation set encountered during training for five different types of neural networks on the S. enterica dataset. c) Average sensitivity and specificity of multiple iterations of training of CNNs for both species on promoter sequences with various lengths of basepairs included before the TSSs.
Performance of SAPPHIRE.CNN.pseudomonas on the different test sets.
| Sensitivity | 98.6 | 78.2 | |
| Specificity | 85.7 | 88.3 | |
| Binary accuracy | 92.2 | 83.3 |
Performance of SAPPHIRE.CNN.salmonella on the different test sets.
| Sensitivity | 81.5 | 59.0 | |
| Specificity | 99.3 | 95.6 | |
| Binary accuracy | 90.4 | 77.3 |
Fig. 3Promoter classification dependency on GC content. Dots represent how many of groups of 100 randomly generated sequences with a certain GC content are classified as promoters by the respective predictors. Full black line: average GC content of the P. aeruginosa genome (PA01, accession: NC_002516). Dashed black line: average GC content of the P. aeruginosa promoter sequences used to train SAPPHIRE.CNN.pseudomonas. Full red line: average GC content of the S. enterica genome (subsp. enterica serovar Typhimurium str. ST4/74, accession: CP002487). Dashed red line: average GC content of the S. enterica promoter sequences used to train SAPPHIRE.CNN.salmonella. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Number of promoters identified by various promoter classifiers in promoter sequences retrieved from NCBI Nucleotide for various Gram-negative genera. For each genus/species, the best performing classifier is highlighted in green.