| Literature DB >> 31293833 |
T Jeffrey Cole1, Michael S Brewer1.
Abstract
In the era of Next-Generation Sequencing and shotgun proteomics, the sequences of animal toxigenic proteins are being generated at rates exceeding the pace of traditional means for empirical toxicity verification. To facilitate the automation of toxin identification from protein sequences, we trained Recurrent Neural Networks with Gated Recurrent Units on publicly available datasets. The resulting models are available via the novel software package TOXIFY, allowing users to infer the probability of a given protein sequence being a venom protein. TOXIFY is more than 20X faster and uses over an order of magnitude less memory than previously published methods. Additionally, TOXIFY is more accurate, precise, and sensitive at classifying venom proteins.Entities:
Keywords: Deep learning; Protein classification; Proteome; Transcriptome; Venom
Year: 2019 PMID: 31293833 PMCID: PMC6601600 DOI: 10.7717/peerj.7200
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Workflow diagram for toxify, including preprocessing of training data, filtering by size and zero padding, converting numeric vectors as Atchley factors, and training on neural network using gated recurrent units.
Figure 2Accuracy progression for RNN as training progressed for both training and test datasets.
Benchmark metrics for toxify compared to clantox and Toxclassifier.
The top portion shows averages and percentages (in parenthesis) of true and false positives (TP & FP) as well as true and false negatives (TN & NF). Additionally, the proportions for accuracy, specificity, sensitivity, balanced accuracy, negative predictive value, positive predictive value, F-score, and Matthew’s correlation coefficient are listed. The bottom portion shows computational performance in terms of CPU time in seconds and memory usage in megabytes. Asterisks indicate metrics in which toxify outperformed.
|
|
|
| |
|---|---|---|---|
| TP* | 147.8 (54.2%) | 152.8 (55.8%) | 209.6 (76.5%) |
| TN | 223.4 (81.8%) | 270.2 (98.6%) | 263.0 (96.0%) |
| FP | 50.6 (18.5%) | 3.8 (1.4%) | 11.0 (4.0%) |
| FN* | 126.2 (46.2%) | 121.2 (44.2%) | 64.4 (23.5%) |
| ACC* | 0.68 | 0.77 | 0.86 |
| SPEC | 0.82 | 0.99 | 0.96 |
| SENS* | 0.54 | 0.56 | 0.76 |
| BACC* | 0.68 | 0.77 | 0.86 |
| NPV* | 0.64 | 0.69 | 0.80 |
| PPV | 0.74 | 0.98 | 0.95 |
| F1* | 0.63 | 0.71 | 0.85 |
| MCC* | 0.37 | 0.60 | 0.74 |
| CPU (s)* | NA | 100.18 | 4.05 |
| MEM (MB)* | NA | 6,824 | 293 |