| Literature DB >> 19429697 |
Guy Naamati1, Manor Askenazi, Michal Linial.
Abstract
Toxins are detected in sporadic species along the evolutionary tree of the animal kingdom. Venomous animals include scorpions, snakes, bees, wasps, frogs and numerous animals living in the sea such as the stonefish, snail, jellyfish, hydra and more. Interestingly, proteins that share a common scaffold with animal toxins also exist in non-venomous species. However, due to their short length and primary sequence diversity, these, toxin-like proteins remain undetected by classical search engines and genome annotation tools. We construct a toxin classification machine and web server called ClanTox (Classifier of Animal Toxins) that is based on the extraction of sequence-driven features from the primary protein sequence followed by the application of a classification system trained on known animal toxins. For a given input list of sequences, from venomous or non-venomous settings, the ClanTox system predicts whether each sequence is toxin-like. ClanTox provides a ranked list of positively predicted candidates according to statistical confidence. For each protein, additional information is presented including the presence of a signal peptide, the number of cysteine residues and the associated functional annotations. ClanTox is a discovery-prediction tool for a relatively overlooked niche of toxin-like cell modulators, many of which are therapeutic agent candidates. The ClanTox web server is freely accessible at http://www.clantox.cs.huji.ac.il.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19429697 PMCID: PMC2703885 DOI: 10.1093/nar/gkp299
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
ClanTox predictions of Elapids and Vipers short proteins
| Taxonomy toxin family | P3 | P2 | P1 | N | Total |
|---|---|---|---|---|---|
| Elapids | 291 | 147 | 260 | 99 | 797 |
| Cardiotoxin | 4 | 1 | 0 | 45 | |
| Cytotoxin | 0 | 1 | 0 | 45 | |
| PLA2 | 2 | 20 | 1 | 137 | |
| Vipers | 79 | 142 | 47 | 86 | 354 |
| Disintegrin | 14 | 3 | 0 | 49 | |
| Ammodytin | 24 | 15 | 0 | 110 | |
| PLA2 | 14 | 13 | 0 | 63 |
The dominating prediction level for each toxin family type is marked in bold. PLA2, phospholipase A2.
Figure 1.Score distribution for the non-redundant set of all ∼30 000 SwissProt proteins shorter than 150 amino acids. The proteins that were used for the classifier positive training set were excluded. Positive and secured prediction scores are (P2–P3) assigned to the tail of the distribution with a mean score >0.2. The intermediate levels of score ranging between –0.2 and 0.2 are considered probable/possible toxin-like. More refined positive prediction confidence (marked as P3, P2 and P1) is defined according to the scale of the SD relative to the mean score. The negative prediction (N) is associated with a mean score <–0.2 that includes the large Gaussian-like distribution for the vast majority of the proteins.
Figure 2.Screenshots of the ClanTox result page. (A) A pie chart displaying the distribution of prediction classes for 526 short sequences (20–160 amino acids) retrieved from the Drosophila proteomes following filtration of all sequences that contain fragments. The positive predictions (P1–P3) constitute 5.9% of the sequences and are marked in shades of red (P3) to light pink (P1). (B) A detailed table of toxin-like proteins predictions sorted by the number of cysteines (Cys). Few examples labeled as P3 (red), P2 (light red) and N (gray) are shown. Together with the ID column, a graphical scheme is used to represent protein sequences (not to scale, light blue bar) marked by cysteine residue distribution (red vertical bars). In addition, the table shows the protein names, UniProtKB accession, the number of cysteines, sequence length, the mean score and standard deviation (SD). Several links to external tools are presented for each predicted protein. (C) Histogram of mean scores for the 526 Drosophila sequences that are used as input. The positive predictions are color coded as in A. Most sequences were predicted negative and are shown in gray. (D) A cysteine centric view, the sequences shown by their relative length. Cysteines are marked as red vertical lines.