| Literature DB >> 32317695 |
Theodor Sperlea1, Lea Muth1, Roman Martin1, Christoph Weigel2, Torsten Waldminghaus3, Dominik Heider4.
Abstract
The biology of bacterial cells is, in general, based on information encoded on circular chromosomes. Regulation of chromosome replication is an essential process that mostly takes place at the origin of replication (oriC), a locus unique per chromosome. Identification of high numbers of oriC is a prerequisite for systematic studies that could lead to insights into oriC functioning as well as the identification of novel drug targets for antibiotic development. Current methods for identifying oriC sequences rely on chromosome-wide nucleotide disparities and are therefore limited to fully sequenced genomes, leaving a large number of genomic fragments unstudied. Here, we present gammaBOriS (Gammaproteobacterial oriC Searcher), which identifies oriC sequences on gammaproteobacterial chromosomal fragments. It does so by employing motif-based machine learning methods. Using gammaBOriS, we created BOriS DB, which currently contains 25,827 gammaproteobacterial oriC sequences from 1,217 species, thus making it the largest available database for oriC sequences to date. Furthermore, we present gammaBOriTax, a machine-learning based approach for taxonomic classification of oriC sequences, which was trained on the sequences in BOriS DB. Finally, we extracted the motifs relevant for identification and classification decisions of the models. Our results suggest that machine learning sequence classification approaches can offer great support in functional motif identification.Entities:
Mesh:
Year: 2020 PMID: 32317695 PMCID: PMC7174414 DOI: 10.1038/s41598-020-63424-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Schematic representation of the structure of gammaBOriS. Evaluation metrics on the right side of the diagram represent performance on the validation dataset.
Figure 2Taxonomic distribution and gammaBOriS prediction results for oriC sequences present in the training and test dataset (see main text for more details). Color codification, from outer to inner ring, separated by semicolons, with numbers of organisms in each category in parentheses: dark blue signifies chromosomes that contain two (105), light blue those that contain a single oriC sequence (355); lime green indicates that two (13) sequences were correctly identified, pale green that one (392) was correctly identified, spring green indicates that a sequence was identified that overlaps with the correct sequence (48), white indicates incorrect identification (7); red, yellow, and white indicate that two (2), one (4), and zero (454) sequences were misidentified as oriC sequence (false-positive), respectively; darker color indicates a higher number of candidate fragments that fall between the two cutoffs, so that gammaBOriS abstained from classification for these (min.: 1, max.: 2463, mean: 129.38).
Figure 3Evaluation of different models for taxonomic classification of oriC sequences present in BOriS DB. MacroAUPR designates the average of the area under the precision-recall curve, a common metric for imbalanced multi-class classification tasks. RF stands for Random Forest. Error bars represent the standard deviation of the AUPR values achieved for the different taxa.
Consensus sequences of important motifs extracted from LS-GKM models trained for taxonomic classification.
| Taxonomic level | Taxon | Taxonomically Relevant Motifs | Annotation | |
|---|---|---|---|---|
| Order | Alteromonadales | AATAACAGTAATA | DnaA trio | |
| AGATCTTAAGATCT | DUE | |||
| Pseudomonadales | TTC | DnaA box (R1) | ||
| plus: AT-rich, ungroupable | DUE | |||
| Vibrionales | AA | T | RctB binding site | |
| Xanthomonadales | GTGGT | ACCA | DnaA box | |
| Family | Enterobacteriaceae | TAAGAGATCA | TGATCTCTTA | DnaA trio |
| Halomonadaceae | ACAGAACTTC | GAAGTTCTGT | DUE | |
| Pseudomonadaceae | TATA | TAWAAAGCTTATA | DnaA trio | |
| TC | DnaA box (R1, R4 or p7/8) | |||
| Vibrionaceae | AA | T | RctB binding site, DnaA Box (R1) | |
| Xanthomonadaceae | GTGGT | DnaA box (R1) | ||
| Genus | DnaA box (R2, R4) | |||
| Acinetobacter | TAAATTTAAATTTA | DnaA trio | ||
| Escherichia | CAAGGATCCAGCTTTTAA | IHF binding site | ||
| Pseudomonas | TC | TTC | DnaA box (R1) | |
| Shigella | CAAGGATCCGATTTTTAA | IHF binding site | ||
| CGCACTACCCTGTGGA | TCCACAGGGTAGTGCG | DnaA trio | ||
Reverse complementary motif pairs are displayed in the same row. Annotation of the motifs is based on substrings highlighted in bold, except for motifs annotated with DUE, which were annotated based on their location in the downstream unwinding element of oriC sequences from the respective taxon. Due to the fact that LS-GKM models ambiguous bases, the consensus sequences presented here are not necessarily present in oriC sequences with this exact sequence. The motifs annotated as RctB binding site conform to the consensus sequence NNNNNNWTGATCATKSWT. DnaA box annotation appendices were adapted from Grimwade et al.[71].