Literature DB >> 21584189

PKSIIIexplorer: TSVM approach for predicting Type III polyketide synthase proteins.

Mallika Vijayan, Sivakumar Krishnankutty Chandrika, Soniya Eppurathu Vasudevan.

Abstract

UNLABELLED: PKSIIIexplorer, a web server based on 'transductive Support Vector Machine' allows fast and reliable prediction of Type III polyketide synthase proteins. It provides a simple unique platform to identify the probability of a particular sequence, being a type III polyketide synthases or not with moderately high accuracy. We hope that our method could serve as a useful program for the type III polyketide researchers. The tool is available at "http://type3pks.in/tsvm/pks3". ABBREVIATIONS: PKS - Polyketide synthase, CHS - Chalcone synthase, SVM - Support vector machine, MCC - Matthews Correlation Coefficient.

Entities: Chemical Disease Species

Keywords: Chalcone synthase; PKSIIIexplorer; TSVM; Type III polyketide synthase

Year: 2011 PMID： 21584189 PMCID： PMC3089887 DOI： 10.6026/97320630006125

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Type III polyketide synthases (type III PKSs) are large superfamily of proteins that produce wide variety of secondary metabolites which possess antibiotic, antifungal, antitumor and immunosuppressive activities [1]. For example, resveratrol, a stilbene synthase derivative from grapes shows cancer chemopreventive activity in murine models [2]. To discover more of these novel proteins, Support Vector Machines (SVMs) have been used successfully for the purpose of classification. Earlier we have developed SVM based “PKSIIIpred”, in which only labelled data were used for training set [3]. But in the improved version, we used an innovative variant of SVM, the so-called ‘Transductive SVM’ (TSVM) that not only take into account the labeled training data but also integrate unlabeled data.

Methodology

Dataset

Positive (type III PKSs) and negative (non-type III PKSs) datasets were prepared (1000 each). Sequences were retrieved in FASTA format from Swiss- Prot. Unlabelled dataset (2000) was generated by profile hidden Markov models (HMMs) using the positive dataset to extract certain proteins from Swiss-Prot. In the case of unlabeled dataset, we are not sure whether they are type III PKS or not. BLASTCLUST was used to verify the non-redundancy of datasets [4].

TSVM- implementation

SVMs are group of fast optimization machine learning algorithms which have been used for many kinds of pattern recognition [5]. The performance of SVM based methods has been optimized by tuning SVM parameters (linear, polynomial, radial or sigmoid). In classical SVMs, the training data that are used to build the model ideally cover the whole problem space; the model is then used to predict the labeling of new data points. But in most of the biological datasets the number of labeled data points is rather small, but a large number of unlabeled data points are available. To take advantage of these unlabeled data, the so called ‘TSVMs’ have been developed [6]. Here, TSVM was implemented using SVMlight package which posess two modules: SVM_learn (preparing models) and SVM_classify (classifying samples). For each cluster of composite specificity, we prepared a feature file with the sequences belonging to this specificity labeled +1, all other sequences with different but known specificity labeled −1, and the uncharacterized sequences labeled 0. TSVM was trained as described above, to obtain a model for composite specificity. During several rounds of evaluation, many parameters produced poorly performing models with poor MCC values. Therefore, selected a set of consistently performing parameters for identifying the optimally performing models. After training the SVM models, it is necessary to combine the predictions of all models to one single prediction. Here the SVM that outputs the largest score is used to assign the specificity to the unknown sequence.

Numerical properties

The models were trained by using dipeptide and multiplet frequencies [7] of amino acid composition. For each protein, a matrix of 400 dipeptides was generated and fed as an input to SVM. The repetitiveness of the amino acid sequences were analyzed by means of multiplet which comprise homopolymeric stretches of any length (XX, XXX, ... (X)n) where X denotes any specific amino acid and n≥2.

Webserver

The server was prepared in Apache version 2.0 and the scripting was done in PHP version 5.3.2. The background running programs for dipeptide and multiplet frequencies were written in Perl 5.8.5.

Performance assessment

Fivefold cross-validation technique was used to evaluate the performance of all the models. We computed the Error rate (err) specificity (SP), sensitivity (SN), and MCC [8-9] for assessing the performance of a method (given in Supplementary material). Sensitivity gives the fraction of positive events; specificity represents how many false subjects are incorrectly recognized as positives; the ‘error rate’ is the fraction of type III PKS data that is classified incorrectly [9]. MCC ranges from −1 to +1 and the highest value indicates better prediction. We identified the model with highest MCC value in each of the five subsets. In the second subset, three models with different parameters sets 47, 65 and 89 were equally good and therefore both of them were included (Table 1 see Table 1).

Discussion

The web-interface of “PKSIIIexplorer” allows, one to ‘upload’ or ‘paste-in’ the sequences in fasta format. Here we describe the application of TSVMs to functionally predict the peptides, based on the chemical fingerprint of the residues. By using various kernal functions, we got the best results for polynomial and radial (RBFs), over linear and sigmoid (Table 1) and found that SVM models yield very good results (MCC = 0.84–0.97). In addition to the plant proteins, we also provided type III PKSs from bacteria, fungi and bryophytes in the training dataset, so they can be perfectly predicted during user investigation. It is noted that the server efficiently predicts type I PKS, ketosynthase domain as negative which adopts similar structural fold and shows sequence similarity to type III PKS. These results demonstrated that the sequence features used by PKSIIIexplorer have powerful discriminating power. The system also found to be superior (Figure 1) to the previous prediction server “PKSIIIpred” (http://type3pks.in/prediction/).

Figure 1

Statistical comparisons indicates that TSVM based PKSIIIexplorer is superior in the case of sensitivity (SN), specificity (SP) and accuracy (AC) than PKSIIIpred.

Conclusion

Because of the diverse pharmacological functions, the volume of data on type III PKS is rapidly increasing. With this regard developing a highly sensitive method to identify the protein ‘in silico’ will accelarate the experimental research. Our results give high reliable predictions, even though the training data is relatively low, leaving a room for further improvement with a growing number of type III PKSs. BLAST could be helpful especially for rare specificities and therefore, we plan to integrate it in a future version of PKSIIIexplorer.

8 in total

Review 1. Assessing the accuracy of prediction algorithms for classification: an overview.

Authors: P Baldi; S Brunak; Y Chauvin; C A Andersen; H Nielsen
Journal: Bioinformatics Date: 2000-05 Impact factor: 6.937

Review 2. The chalcone synthase superfamily of type III polyketide synthases.

Authors: Michael B Austin; Joseph P Noel
Journal: Nat Prod Rep Date: 2003-02 Impact factor: 13.423

3. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors: B W Matthews
Journal: Biochim Biophys Acta Date: 1975-10-20

4. Methods and algorithms for statistical analysis of protein sequences.

Authors: V Brendel; P Bucher; I R Nourbakhsh; B E Blaisdell; S Karlin
Journal: Proc Natl Acad Sci U S A Date: 1992-03-15 Impact factor: 11.205

5. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

6. Kernel based machine learning algorithm for the efficient prediction of type III polyketide synthase family of proteins.

Authors: V Mallika; K C Sivakumar; S Jaichand; E V Soniya
Journal: J Integr Bioinform Date: 2010-07-13

7. An overview of statistical learning theory.

Authors: V N Vapnik
Journal: IEEE Trans Neural Netw Date: 1999

8. Cancer chemopreventive activity of resveratrol, a natural product derived from grapes.

Authors: M Jang; L Cai; G O Udeani; K V Slowing; C F Thomas; C W Beecher; H H Fong; N R Farnsworth; A D Kinghorn; R G Mehta; R C Moon; J M Pezzuto
Journal: Science Date: 1997-01-10 Impact factor: 47.728

8 in total

4 in total

1. Discriminating the reaction types of plant type III polyketide synthases.

Authors: Yugo Shimizu; Hiroyuki Ogata; Susumu Goto
Journal: Bioinformatics Date: 2017-07-01 Impact factor: 6.937

Review 2. The secondary metabolite bioinformatics portal: Computational tools to facilitate synthetic biology of secondary metabolite production.

Authors: Tilmann Weber; Hyun Uk Kim
Journal: Synth Syst Biotechnol Date: 2016-02-05

3. Active learning framework with iterative clustering for bioimage classification.

Authors: Natsumaro Kutsuna; Takumi Higaki; Sachihiro Matsunaga; Tomoshi Otsuki; Masayuki Yamaguchi; Hirofumi Fujii; Seiichiro Hasezawa
Journal: Nat Commun Date: 2012 Impact factor: 14.919

4. De novo sequencing of Hypericum perforatum transcriptome to identify potential genes involved in the biosynthesis of active metabolites.

Authors: Miao He; Ying Wang; Wenping Hua; Yuan Zhang; Zhezhi Wang
Journal: PLoS One Date: 2012-07-30 Impact factor: 3.240

4 in total