| Literature DB >> 30513724 |
Qing Li1,2, Maren Watkins3, Samuel D Robinson4, Helena Safavi-Hemami5,6, Mark Yandell7,8.
Abstract
Cone snails (genus Conus) are venomous marine snails that inject prey with a lethal cocktail of conotoxins, small, secreted, and cysteine-rich peptides. Given the diversity and often high affinity for their molecular targets, consisting of ion channels, receptors or transporters, many conotoxins have become invaluable pharmacological probes, drug leads, and therapeutics. Transcriptome sequencing of Conus venom glands followed by de novo assembly and homology-based toxin identification and annotation is currently the state-of-the-art for discovery of new conotoxins. However, homology-based search techniques, by definition, can only detect novel toxins that are homologous to previously reported conotoxins. To overcome these obstacles for discovery, we have created ConusPipe, a machine learning tool that utilizes prominent chemical characters of conotoxins to predict whether a certain transcript in a Conus transcriptome, which has no otherwise detectable homologs in current reference databases, is a putative conotoxin. By using ConusPipe on RNASeq data of 10 species, we report 5148 new putative conotoxin transcripts that have no homologues in current reference databases. 896 of these were identified by at least three out of four models used. These data significantly expand current publicly available conotoxin datasets and our approach provides a new computational avenue for the discovery of novel toxin families.Entities:
Keywords: cone snails; conotoxins; drug discovery; machine learning; venom
Mesh:
Substances:
Year: 2018 PMID: 30513724 PMCID: PMC6315676 DOI: 10.3390/toxins10120503
Source DB: PubMed Journal: Toxins (Basel) ISSN: 2072-6651 Impact factor: 4.546
Figure 1(A) Overview of feature selection for machine learning models (B) and cross-species Blastp methodology used in addition to the machine learning model.
Figure 2Venn diagram illustrates the different combinations of methodologies used (single method, overlap, or union of methods) and the total number of putative toxin candidates identified by each method (unique number of toxin candidates are shown in parentheses). A union of methods means that a conotoxin is predicted by one or more methods, for example, Union.4methods = predicted by perceptron or logit or label spreading or blast. An overlap of methods means that the conotoxin is predicted by all the applied methods, for example, Overlap.4methods = predicted by perceptron and logit and label spreading and blast. Abbreviations used: blast—B; logit—L; labelspreading—S; perceptron—P.
Maximized sensitivity, specificity, and accuracy for chosen regularization parameter settings for three machine learning models in 10-fold cross validation.
| Machine Learning Model | Performance Measure | ||
|---|---|---|---|
| Sensitivity | Specificity | Accuracy | |
| Logit | 82.85% | 99.30% | 97.78% |
| Label spreading | 90.93% | 99.07% | 98.32% |
| Perceptron | 83.24% | 97.65% | 96.32% |
Specificity of the individual machine learning methods and their unions/combinations when searching results against the Uniprot/Swissprot non-conotoxin database and mean sensitivity for recovering all known conotoxin superfamilies. Methods are ordered as follows: Overlap between different methods, single methods, and union of methods. Methods with the highest sensitivities (≥99.7%) and specificities (≥99.9%) are shown in bold. Abbreviations used: blast—B; logit—L; label spreading—S; perceptron—P.
| Methods | Mean Sensitivity | Specificity |
|---|---|---|
| Overlap.4methods | 34.19% ± 0.32% |
|
| Overlap.LSP | 41.53% ± 0.35% |
|
| Overlap.LSB | 35.15% ± 0.32% | 99.87% |
| Overlap.LPB | 76.25% ± 0.37% | 99.57% |
| Overlap.SPB | 34.22% ± 0.32% |
|
| Overlap.LS | 43.18% ± 0.34% | 99.85% |
| Overlap.LP | 83.61% ± 0.32% | 99.53% |
| Overlap.SP | 41.68% ± 0.35% |
|
| Overlap.LB | 79.57% ± 0.35% | 98.32% |
| Overlap.SB | 35.52% ± 0.32% | 99.86% |
| Overlap.PB | 78.02% ± 0.36% | 99.57% |
| Logit | 87.61% ± 0.26% | 99.83% |
| Labelspreading (SemiS) | 43.67% ± 0.34% | 99.49% |
| Perceptron (NeuroNetWork) | 85.96% ± 0.29% | 94.02% |
| Blastp | 87.10% ± 0.28% | 98.19% |
| Union.4methods |
| 93.89% |
| Union.LSP | 90.31% ± 0.22% | 98.15% |
| Union.LSB | 95.25% ± 0.12% | 93.89% |
| Union.LPB |
| 93.90% |
| Union.SPB |
| 93.97% |
| Union.LS | 88.10% ± 0.26% | 98.17% |
| Union.LP | 89.96% ± 0.23% | 98.17% |
| Union.SP | 87.95% ± 0.25% | 99.41% |
| Union.LB | 95.14% ± 0.12% | 93.90% |
| Union.SB | 95.24% ± 0.12% | 93.99% |
| Union.PB | 95.05% ± 0.15% | 93.98% |
Conotoxin candidates expressed in 12 samples from 10 Conus species identified by blastp and machine learning (ML).
| No. of Conotoxins Identified by Blastp against Uniprot/ConoServer Database | No. of Conotoxins Identified by Blastp against Uniprot/ConoServer Database Also Retrieved Using ML | No. of Conotoxins Identified by ML and Subsequently Identified as Conotoxins by Blastp against NCBI | No. of Conotoxin Candidates Identified by ML Only | |
|---|---|---|---|---|
|
| 49 | 29 | 13 | 984 |
|
| 34 | 25 | 10 | 1069 |
|
| 21 | 13 | 9 | 522 |
| 16 | 10 | 3 | 61 | |
|
| 95 | 58 | 19 | 532 |
|
| 62 | 38 | 15 | 529 |
| 29 | 24 | 17 | 61 | |
|
| 54 | 31 | 23 | 739 |
| 37 | 22 | 6 | 44 | |
|
| 67 | 46 | 21 | 389 |
| 94 | 59 | 38 | 97 | |
| 17 | 13 | 24 | 121 |
Figure 3Comparative alignments of selected sequences identified by at least three out of four methods that are (A) likely or (B) not likely to represent genuine novel conotoxins. Cysteines are highlighted in yellow, signal sequences, pro- and postpeptides and predicted mature toxins are underlined in purple, green, and blue, respectively, as shown on top of panel A. Sequence labels (contigs) correspond to those provided in Supplementary File 2.
RNAseq Data sets from 12 samples in 10 Conus species used in the discovery pipeline.
| Illumina HiSeq 2000 | SRA Accession Number | ||
|---|---|---|---|
| Number of Reads | Read Length (nt) | ||
|
| 85,877,500 | 101 | SRX5015024 |
|
| 101,170,402 | 101 | SRX5015022 |
|
| 53,901,510 | 125 | SRX5015020 |
| 50,652,396 | 101 | SRX1323884 | |
|
| 63,365,620 | 125 | SRX5015023 |
|
| 28,783,428 | 125 | SRX2779517 |
| 30,784,548 | 125 | SRX1323891 | |
|
| 30,038,902 | 125 | SRX5015021 |
| 31,056,732 | 125 | SRX1323883 | |
|
| 31,180,460 | 125 | SRX5015025 |
| 27,927,952 | 125 | SRX1323894 | |
| 19,556,244 | 125 | SRX1323887 | |