| Literature DB >> 35725628 |
Advait Balaji1, Bryce Kille1, Anthony D Kappell2, Gene D Godbold3, Madeline Diep4, R A Leo Elworth1, Zhiqin Qian1, Dreycey Albin1, Daniel J Nasko5, Nidhi Shah5, Mihai Pop5, Santiago Segarra6, Krista L Ternus7, Todd J Treangen8.
Abstract
The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .Entities:
Mesh:
Year: 2022 PMID: 35725628 PMCID: PMC9208262 DOI: 10.1186/s13059-022-02695-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Fig. 1Comparison of VFDB to SeqScreen Biocurator Database. A Venn-diagram shows the number of GO terms captured by VFDB Core sequences, the SeqScreen training dataset labeled by biocurators, and their overlap. B Box-plot showing the comparison of annotation scores (1-5) of the associated UniProt/UniParc IDs between VFDB Core sequences and SeqScreen training data. The p-value was calculated using the Mann-Whitney U test
Fig. 2SeqScreen overview. A SeqScreen Workflow: this figure outlines the various modules and workflows of the SeqScreen pipeline. Boxes in green indicate that these modules are only run in the sensitive mode. The boxes in yellow are run in the fast mode, while the ones in blue are common to both modes. In addition to the two different modes, SeqScreen also contains optional modules that can be run based on the parameters provided by the user. B SeqScreen Human-in-the-loop Framework: includes initial annotation and curation of training data by manual curation. The data is used to train Ensemble ML models. The results obtained and selected feature weights are passed on back to biocurators to fine tune features and UniProt queries which form a new set of refined training data for the Ensemble model
Fig. 3HTML report output from SeqScreen. This is a screenshot of the interactive HTML page that outputs each query sequence in the file, the length, the gene name (if found), and GO terms associated with it. It also outputs the presence (or absence) of each of the 32 FunSoCs by denoting a 1 (or 0) in the given field
Fig. 4Majority Voting Ensemble Classifier used to create FunSoC Database. The top three models combined are Bl. SVC + NN(OS), balanced linear support vector classifier + neural networks (oversampled); TS NN, two-stage neural network; and TS Bl.SVC, two-stage balanced linear support vector classifier. The binary predictions of each of the classifiers over each FunSoC are combined in a majority voting scheme to predict the final labels for the SeqScreen FunSoC database which is then used to annotate query sequences. Training data is split into train (56.75%), validation (18.25%), and test (25%). The two-stage methods first detect presence of at least one FunSoC and then carry out the multi-class multi-label predictions. Dropouts (neural networks) and L1-regularization (support vector classifier) are used to control for overfitting. Two of the models use random oversampling (Bl. SVC + NN(OS), after feature selection), and class weights (TS Bl. SVC) to deal with class imbalance in the training data
The accuracy, exact match ratio, micro and macro F1 score, macro recall, and precision of the different ML models. The models we considered were balanced SVC (feature selection) + neural network classification using oversampling (Bl. SVC + NN (OS)), two-stage detection + classification neural networks (TS NN), two-stage detection + classification balanced support vector classifier (TS Bl. SVC), and the majority vote ensemble classifier (MV ensemble). TS NN had the highest positive label (PL) precision and TS Bl.SVC had the highest positive label (PL) recall, while Bl. SVC + NN (OS) had the best balance between precision and recall. Majority vote ensemble improved on the results of the three classifiers as conveyed by both the high precision and recall the method achieves
| Model | Accuracy | Exact match ratio | Micro F1 score | Macro F1 score | Macro recall | Macro precision | Mean PL precision | Mean PL recall |
|---|---|---|---|---|---|---|---|---|
| Bl. SVC + NN (OS) | 0.9997 | 0.9924 | 0.9859 | 0.8210 | 0.8039 | 0.8716 | 0.8759 | 0.8180 |
| TS NN | 0.9997 | 0.9924 | 0.9359 | 0.6934 | 0.6445 | 0.8011 | 0.8893 | 0.6988 |
| TS Bl.SVC | 0.9996 | 0.9893 | 0.8692 | 0.7047 | 0.8310 | 0.6492 | 0.7382 | 0.8869 |
| MV ensemble | 0.9997 | 0.9934 | 0.9424 | 0.7998 | 0.8016 | 0.8453 | 0.9003 | 0.8273 |
Fig. 5Positive label precision and recall per FunSoC for the four ML models Bl. SVC + NN (OS) (in blue), TS NN (in green), TS Bl. SVC (in yellow), and MV ensemble (in brown). Precision is in solid lines and recall is in dotted lines. TS Bl. SVC shows the best overall recall, whereas TS NN consistently has the highest precision across most of the 32 FunSoCs. In hard-to-classify FunSoCs like nonviral invasion and bacterial counter signaling, TS NN performs poorly indicating a model with a high degree of variance. Similarly, TS Bl. SVC suffers from poor precision in most cases. The majority vote classifier improves on the Bl. SVC + NN (OS) and finds an optimal balance between precision and recall across all FunSoCs
Fig. 6Pathogen identification of hard-to-classify pathogens: FunSoCs assigned to genes by SeqScreen. Abbreviated gene names are listed in pink cells if at least one read from the gene had a UniProt e-value < 0.0001 was assigned a FunSoC and was from the expected genus (i.e., Escherichia or Shigella, Clostridium, Streptococcus, Lactobacillus). FunSoCs with at least one gene that met the criteria for detection in at least one isolate were included in the table. The removal of genes from genera that were not expected in these bacterial isolates allowed for removal of genes that were likely derived from likely contaminating organisms (e.g., PhiX Illumina sequencing control). An expanded table for cells denoted by (*) and complete gene names are listed within each cell in Table S3. (a and b) E. coli O157:H7 is shown to have presence of the shiga toxin (stxB) as seen in the cytotoxicity FunSoC, as well as an additional hit to the secreted effector protein (espF(U)), labeled with secreted effector and virulence regulator FunSoCs, compared to E. coli K12 MG1655. (c and d) C. botulinum showed four distinct FunSoCs (disable organ, cytotoxicity, degrade ecm and virulence regulator) and presence of the botA and orf-X2 genes compared to C. sporogenes. (e and f) S. pyogenes showed presence of the induce inflammation FunSoC in contrast to the near neighbor pathogen S. dysgalactiae with the counter immunoglobulin FunSoC. (g and h). S. salivarius and L. gasseri are well-known commensals that are generally considered harmless. Both show presence of antibiotic resistance genes, while S. salivarius also contains some genes associated with secretion. The commensals have hits to the least number of FunSoCs
Pathogen and near neighbor classification. SRA represents the SRA id of the sample, True Organism represents the actual bacterial strain or species, and the remaining columns indicate the results for the indicated method using the parameters detailed in the “Methods” section. Green cells indicate that the tool assigned a correct strain-level call, yellow indicates a correct species-level call, and red indicates an incorrect species-level call. The following tools and databases were run: Mash dist (RefSeq 10 k), Sourmash (RefSeq + GenBank), PathoScope (PathoScope DB), Kraken 2 (Mini and full Kraken2 DB produced the same results), KrakenUniq (MiniKraken 8GB), MetaPhlAn3 (default), and Kaiju (index of NCBI nr + euk). The E. coli strains were challenging for most tools. The pathogenic E. coli O157:H7 was correctly called by Mash dist, Sourmash, PathoScope, Kraken2, and KrakenUniq. MetaPhlAn and Kaiju could only make a species level assignment. In contrast, the commensal E. coli K12 MG1655 was the most challenging as only Mash dist and Sourmash got the strain level assignment correct. MetaPhlAn3 and Kaiju could make only species level assignments, and PathoScope, Kraken2, and KrakenUniq called it as strains E. coli BW2952, E. coli O157:H7, and E. coli O145:H28, respectively. Even with a complete database, C. sporogenes was wrongly classified as C. botulinum by PathoScope, Kraken2, and KrakenUniq. Mash dist, Sourmash, and Kaiju predicted C. sporogenes correctly while MetaPhlAn3 was ambiguous. C. botulinum was incorrectly classified as C. sporogenes by Mash dist, Sourmash, and S. dysgalactiae was predicted as S. pyogenes by PathoScope. All tools correctly called S. pyogenes
Simulating a novel pathogen. Mash dist and PathoScope were run on pathogen sequences and their near neighbors with the corresponding truth species removed in their respective databases to simulate an example of classifying a novel pathogen not in the database. SRA represents the SRA id of the sample, True Organism represents the actual bacterial strain or species, Mash dist represents the Mash results on each of the samples (with the truth organism species or strain removed from its sketch database), and PathoScope represents the PathoScope results on each of the samples (with the truth organism species or strain removed from its database). In three of the cases, C. sporogenes, C. botulinum, and S. pyogenes, Mash dist classified the organism as it near neighbor—C. botulinum, C. sporogenes, and S. dysgalactiae, respectively. S. dysgalactiae was classified as S. sp. NCTC 11567 whereas the commensal E. coli K12 and pathogenic E. coli 0157:H7 were classified as E. coli O16:H48 and E. coli 2009C-3554, respectively. PathoScope only classified two pathogens, C. sporogenes and C. botuinum, as their nearest neighbor counterparts. S. dysgalactiae was classified as S. intermedius, whereas S. pyogenes was classified as S. infantarius. E. coli K12 was only classified at the species level, while the pathogenic strain E. coli O157:H7 was classified as E. coli xuzhou21
| SRA | True Organism | Mash dist | PathoScope |
|---|---|---|---|
| DRR198806 | |||
| DRR198804 | |||
| SRR8758382 | |||
| SRR8981313 | |||
| SRR12825903 | |||
| ERR1735064 |