Literature DB >> 22661581

ConoDictor: a tool for prediction of conopeptide superfamilies.

Dominique Koua¹, Age Brauer, Silja Laht, Lauris Kaplinski, Philippe Favreau, Maido Remm, Frédérique Lisacek, Reto Stöcklin.

Abstract

ConoDictor is a tool that enables fast and accurate classification of conopeptides into superfamilies based on their amino acid sequence. ConoDictor combines predictions from two complementary approaches-profile hidden Markov models and generalized profiles. Results appear in a browser as tables that can be downloaded in various formats. This application is particularly valuable in view of the exponentially increasing number of conopeptides that are being identified. ConoDictor was written in Perl using the common gateway interface module with a php submission page. Sequence matching is performed with hmmsearch from HMMER 3 and ps_scan.pl from the pftools 2.3 package. ConoDictor is freely accessible at http://conco.ebc.ee.

Entities: Chemical

Mesh：

Substances：
Conotoxins

Year: 2012 PMID： 22661581 PMCID： PMC3394318 DOI： 10.1093/nar/gks337

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Conopeptides are the main bioactive component of cone snail venom. These marine animals produce complex venoms that contain hundreds of peptides and proteins. Recently, conopeptides have attracted a great deal of interest as a result of their selectivity for, and potent effects on, ion channels and receptors (1,2). Most are cysteine-knotted peptides that have been classified into superfamilies and families based on their structural or functional features (3,4). To date, >1500 non-redundant conopeptide sequences are stored in public databases and this number is increasing exponentially. Conopeptides are classified into ‘gene superfamilies’ based on their signal sequence. Currently, there are 16 major superfamilies, namely: A, D, I1, I2, I3, J, L, M, O1, O2, O3, P, S, T, V and Y. The precursors generally contain an N-terminal signal sequence, a central propeptide region and a C-terminal hypervariable mature toxin (4,5). The Conoserver database (http://www.conoserver.org/) is a repository of nucleic acid and protein sequences, and of structural information on conopeptides (4). The naming and classification of new conopeptide protein sequences has become an important issue because of the sharp increase in the number of new conopeptides being identified, and because studies to determine the peptide’s functional characteristics are based on this classification. The Conoserver prosequence analyzer (ConoPrec) is the most specific web tool available for elucidation of conopeptide class. It provides hints based on the signal peptide sequence of the submitted precursor (6). However, this tool does not work when the signal sequence is missing, which is often the case with conopeptides identified by proteomic and mass spectrometry studies of toxins identified as mature bioactive peptides in venom or dissected venom gland. As data generated by spreading venom high-throughput omics is notoriously incomplete, the classification of new sequences into conotoxin superfamilies should not be restricted to the signal peptide sequence. There is indisputable evidence for the relevance of consensus sequences of propeptides and cysteine frameworks in conopeptide sequences. Thus, the inclusion of these criteria should also be considered for conopeptide classification. We recently demonstrated the reliability of conopeptide family prediction and classification based on profile hidden Markov models (pHMM) of propeptides and mature peptides (7). ConoDictor has been developed in the context of the CONCO project (www.conco.eu) and is a web-based tool that exploits pHMMs and position-specific scoring matrix (PSSM, also known as generalized profiles) to classify conopeptide into superfamilies based on their amino acid sequence. ConoDictor is a user-friendly tool that meets users’ demands for an easy-to-use environment for sequence classification and superfamily prediction. As a fully automated tool, ConoDictor provides classification results that must be checked by users.

MATERIALS AND METHODS

Preparation of the model data set

Sequences used for generating the models were obtained from Conoserver. Only precursor sequences with gene superfamily annotation were considered. The training set consisted of 933 sequences. Each sequence was manually annotated with the gene superfamily classification after checking the classification provided by Conoserver. Each sequence was divided into three parts, which were stored separately: signal, propeptide and mature peptide. Separate files were also created for each of the 16 superfamilies. The sequences were then aligned using the MAFFT version 6.707b software. The alignments were manually refined when necessary using the JALVIEW 2.5 software, and the resulting 48 alignments were used to build the models.

Hidden Markov models

We previously described pHMM ability for conopeptide classification (7). We constructed pHMMs for each of the 48 alignments using the hmmbuild script from the HMMER 3.0 package (8,9). Matches between pHMMs and the sequence data set were searched using the hmmsearch script with an e-value significance level set to 0.1.

Generalized profiles (PSSM)

Generalized profiles were constructed using the pftool package, version 2.3. The most recent methodology based on annotated multiple sequence alignment (AMSA) was used. The generalized profiles were generated using apsimake in a semi-global mode after weighing of alignments. The resulting models were calibrated against randomized sequences and cut-off values tuned manually. These approaches have already been validated for classification of other proteins (10,11).

Testing of models on known conopeptides

The test set was constructed from publicly available conopeptide sequences extracted from the NCBI Protein database and UniProtKB (release 2010_11). The test set contained 1225 manually curated sequences. Sequences were manually annotated and assigned to the relevant superfamily according to UniProtKB annotations, cysteine frameworks and sequence similarity. Sequences not belonging to any superfamily were added to the test set as negative controls.

ConoDictor implementation

Input sequences are first classified using pHMMs and PSSMs separately. pHMM models of signal (X_sig), propeptide (X_pro) and mature peptide (X_mat) are used in parallel and corresponding predictions are combined. The same process is applied with PSSM models. Resulting pHMM and PSSM classifications are merged to produce a global combined classification. (i) For pHMM-based classification, we adopted the product of E-values as final score: provided that the corresponding E-value exists. A sequence was predicted to belong to the superfamily with the smallest pHMM score when this score was at least one hundred times lower than that of any other superfamily. If the difference was smaller, the sequence was predicted as ‘CONFLICT’ for the pHMM. When no score was generated for any superfamily, the sequence was tagged ‘UNKNOWN’. (ii) For generalized profile predictions, it is not possible to compare and merge scores obtained from separate profiles. The PSSM prediction score for a sequence is the number of models of one superfamily (1–3) that match the sequence: where the boolean function HasMatch(sequence, model) returns 1 if the sequence matched the considered model, or 0 otherwise. The sequence is predicted to belong to the superfamily with the highest score. If two or more superfamilies have the same score, the sequence is tagged as ‘CONFLICT’, and the list of conflicting families is returned. When no match is reported for a given sequence, the sequence is tagged ‘UNKNOWN’. Match lists of pHMMs and PSSMs are merged, and each prediction is weighted according to its frequency. The combined prediction is the superfamily with the highest frequency. When the highest frequency is linked to more than one superfamily, the sequence is tagged ‘CONFLICT’. When no match is reported for either method, the sequence is tagged ‘UNKNOWN’. Even if HMM and PSSM are very robust classification approaches, the reduced size of learning set in some families and/or the underlying scoring system can justify rare cases of misclassification. The ‘CONFLICT’ and ‘UNKNOWN’ tag can therefore represent not modelled families (may be new ones) or divergent sequences from an existing family. In any case, all classifications have to be validated by users before being used for further studies.

RESULTS

Conopeptide models

For each of the 16 known conotoxin superfamilies, three separate models based on signal, pro- and mature peptides were built, providing a total of 48 hidden Markov models and 48 generalized profiles. The models were named according to the superfamily and the region of the precursor that they targeted. Each model demonstrated very good discriminative abilities, with high sensitivity (∼95%) and selectivity (∼99%) [(7) and Koua et al., unpublished data] . When tested using known conopeptide sequences, these models enabled extensive and reliable classification even between superfamilies containing mature peptides with high sequence similarities. The models provided good evidence of complementarity between signal, pro- and mature peptide sequences for superfamily determination, as well as complementarity between pHMMs and generalized profiles (Koua et al., submitted).

ConoDictor input interface

ConoDictor accepts amino acid sequences in FASTA format as input. The sequences can be pasted in the prepared field or uploaded as a file from the user’s computer (Figure 1). Sequences can be annotated with a predicted superfamily in the header between sharps (#), otherwise they are considered as ‘UNKNOWN’. By default, the models built in the framework of the CONCO project are used to analyse the input sequences. However, users can also upload their own PSSMs and/or pHMMs. An annotated testing set (attached to a ‘LOAD TEST DATA’ button) is also available from the input interface.

Figure 1.

ConoDictor input (background) and output (foreground) interfaces. The input interface provides a text area for amino acid sequence in FASTA format and areas for users to upload their own models. A test set is also provided and can be loaded via a simple click. The output interface provides detailed, self-explanatory tables grouped by analysis type. The combined prediction/classification is summarized under the ‘General result’ tab.

Visualization interface

The ConoDictor output interface offers user-friendly tab views of matching outputs and predictions (Figure 1). The main tab provides combined prediction, as well as a summary of pHMM- and PSSM-based prediction. Detailed result tabs for pHMM- and PSSM-based predictions provide the number of sequence matches for each model, the position for each sequence/model match, and the related e-value and score of individual model match. Tab headers and table column names explain the results displayed. The result page is automatically updated until analysis results are available. An Excel file (.xls) and raw text versions (.txt) of all results can be downloaded. A session identifier is also provided, and the results can be accessed and visualized on the server for up to 3 weeks after submission or last viewing. A detailed help page provides clear explanations and screen shots of the most important tables of the analysis (http://conco.ebc.ee/ConoDictor_help.html).

CONCLUSION

ConoDictor is a web-based application, based on preliminary studies that established PSSM and pHMM complement each other for conopeptide identification and classification. Thanks to a user-friendly interface, ConoDictor provides an easy-to-use environment for classification of conopeptides into superfamilies based on their amino acid sequence. In view of the rapidly increasing number of new conopeptides being discovered by next-generation transcriptomic platforms, ConoDictor is a valuable bioinformatics tool for their classification and serves as a starting point for investigation of their functional characteristics.

FUNDING

CONCO project [LSHB-CT-2007-037592, in part]: www.conco.eu funded by EU 6th Framework Programme (LIFESCIHEALTH); the European Regional Development Fund (Estonian Center of Excellence in Genomics); the Atheris Laboratories. Funding for open access charge: LIFESCIHEALTH [LSHB-CT-2007-037592]. Conflict of interest statement. None declared.

9 in total

Review 1. Conotoxins, in retrospect.

Authors: B M Olivera; L J Cruz
Journal: Toxicon Date: 2001-01 Impact factor: 3.033

2. Identification and classification of conopeptides using profile Hidden Markov Models.

Authors: Silja Laht; Dominique Koua; Lauris Kaplinski; Frédérique Lisacek; Reto Stöcklin; Maido Remm
Journal: Biochim Biophys Acta Date: 2011-12-30

Review 3. Conotoxins down under.

Authors: Raymond S Norton; Baldomero M Olivera
Journal: Toxicon Date: 2006-07-15 Impact factor: 3.033

Review 4. PeroxiBase: a powerful tool to collect and analyse peroxidase sequences from Viridiplantae.

Authors: Michele Oliva; Grégory Theiler; Marcel Zamocky; Dominique Koua; Marcia Margis-Pinheiro; Filippo Passardi; Christophe Dunand
Journal: J Exp Bot Date: 2008-12-26 Impact factor: 6.992

5. Conopeptide characterization and classifications: an analysis using ConoServer.

Authors: Quentin Kaas; Jan-Christoph Westermann; David J Craik
Journal: Toxicon Date: 2010-03-06 Impact factor: 3.033

Review 6. Conotoxins - new vistas for peptide therapeutics.

Authors: R M Jones; G Bulaj
Journal: Curr Pharm Des Date: 2000-08 Impact factor: 3.116

7. Hidden Markov model speed heuristic and iterative HMM search procedure.

Authors: L Steven Johnson; Sean R Eddy; Elon Portugaly
Journal: BMC Bioinformatics Date: 2010-08-18 Impact factor: 3.169

8. ConoServer: updated content, knowledge, and discovery tools in the conopeptide database.

Authors: Quentin Kaas; Rilei Yu; Ai-Hua Jin; Sébastien Dutertre; David J Craik
Journal: Nucleic Acids Res Date: 2011-11-03 Impact factor: 16.971

9. PeroxiBase: a database with new tools for peroxidase family classification.

Authors: Dominique Koua; Lorenzo Cerutti; Laurent Falquet; Christian J A Sigrist; Grégory Theiler; Nicolas Hulo; Christophe Dunand
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971

9 in total

15 in total

1. Molecular phylogeny, classification and evolution of conopeptides.

Authors: N Puillandre; D Koua; P Favreau; B M Olivera; R Stöcklin
Journal: J Mol Evol Date: 2012-07-04 Impact factor: 2.395

2. Mass spectrometric identification and denovo sequencing of novel conotoxins from vermivorous cone snail (Conus inscriptus), and preliminary screening of its venom for biological activities in vitro and in vivo.

Authors: Ruchi P Jain; Benjamin Franklin Jayaseelan; Carlton Ranjith Wilson Alphonse; Ahmed Hossam Mahmoud; Osama B Mohammed; Bandar Mohsen Ahmed Almunqedhi; Rajesh Rajaian Pushpabai
Journal: Saudi J Biol Sci Date: 2020-12-24 Impact factor: 4.219

Review 3. Bioinformatics-Aided Venomics.

Authors: Quentin Kaas; David J Craik
Journal: Toxins (Basel) Date: 2015-06-11 Impact factor: 4.546

4. Short toxin-like proteins attack the defense line of innate immunity.

Authors: Yitshak Tirosh; Dan Ofer; Tsiona Eliyahu; Michal Linial
Journal: Toxins (Basel) Date: 2013-07-23 Impact factor: 4.546

5. Identifying the Types of Ion Channel-Targeted Conotoxins by Incorporating New Properties of Residues into Pseudo Amino Acid Composition.

Authors: Yun Wu; Yufei Zheng; Hua Tang
Journal: Biomed Res Int Date: 2016-08-18 Impact factor: 3.411

6. Spider Neurotoxins, Short Linear Cationic Peptides and Venom Protein Classification Improved by an Automated Competition between Exhaustive Profile HMM Classifiers.

Authors: Dominique Koua; Lucia Kuhn-Nentwig
Journal: Toxins (Basel) Date: 2017-08-08 Impact factor: 4.546

10. Systematic interrogation of the Conus marmoreus venom duct transcriptome with ConoSorter reveals 158 novel conotoxins and 13 new gene superfamilies.

Authors: Vincent Lavergne; Sébastien Dutertre; Ai-hua Jin; Richard J Lewis; Ryan J Taft; Paul F Alewood
Journal: BMC Genomics Date: 2013-10-16 Impact factor: 3.969