Literature DB >> 17526528

CytoSVM: an advanced server for identification of cytokine-receptor interactions.

Jin-Rui Xu¹, Jing-Xian Zhang, Bu-Cong Han, Liang Liang, Zhi-Liang Ji.

Abstract

The interactions between cytokines and their complementary receptors are the gateways to properly understand a large variety of cytokine-specific cellular activities such as immunological responses and cell differentiation. To discover novel cytokine-receptor interactions, an advanced support vector machines (SVMs) model, CytoSVM, was constructed in this study. This model was iteratively trained using 449 mammal (except rat) cytokine-receptor interactions and about 1 million virtually generated positive and negative vectors in an enriched way. Final independent evaluation by rat's data received sensitivity of 97.4%, specificity of 99.2% and the Matthews correlation coefficient (MCC) of 0.89. This performance is better than normal SVM-based models. Upon this well-optimized model, a web-based server was created to accept primary protein sequence and present its probabilities to interact with one or several cytokines. Moreover, this model was applied to identify putative cytokine-receptor pairs in the whole genomes of human and mouse. Excluding currently known cytokine-receptor interactions, total 1609 novel cytokine-receptor pairs were discovered from human genome with probability approximately 80% after further transmembrane analysis. These cover 220 novel receptors (excluding their isoforms) for 126 human cytokines. The screening results have been deposited in a database. Both the server and the database can be freely accessed at http://bioinf.xmu.edu.cn/software/cytosvm/cytosvm.php.

Entities: Gene Species

Mesh：

Substances：

Year: 2007 PMID： 17526528 PMCID： PMC1933174 DOI： 10.1093/nar/gkm254

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The binding of cytokines to their receptors on cell membranes triggers the cellular activities such as immunological regulation, cell growth, differentiation, apoptosis and migration in vertebrates (1). Therefore, characterization of novel cytokine-receptor pairs becomes the shortcut to understand these cytokine-mediated signal pathways. The traditional isolation and characterization methods for identification of cytokine-receptor pairs are significantly limited by their characteristics of short half life, low plasma concentrations, pleiotropy and redundancy. It has been improved by the applications of modern molecular technologies such as cloning technology. Furthermore, as a complementary solution to experimental approaches, searches for new members of cytokines or their receptors are now often conducted by identifying genes highly homologous to known cytokine/receptor genes. Currently, 203 human cytokine-receptor pairs have been characterized as presented in KEGG pathway database (2). Unfortunately, it has become more and more difficult to discover new partners of cytokine and receptor if no new sequence features were identified. Especially for those peptides without significant sequence similarity to known cytokines/receptors, their functions are difficult to be probed on the basis of homologous or clustering methods. Various alternative methods for describing protein interactions have been developed in recent years. These include evolutionary analysis (3,4), Hidden Markov Models (5), structural consideration (6–8), protein/gene fusion (9,10), motifs recognition (11), family classification by sequence clustering (12) and functional family prediction by statistical learning methods (13,14). Support vector machines (SVMs) is a two-class classifier, which has been previously used in the classification of cytokine families (http://www.bioinfo.tsinghua.edu.cn/%7Ehn/CTKPred/index.html) (14). In this study, we constructed an improved SVM model, CytoSVM, for the identification of cytokine-receptor interactions on the basis of protein primary sequences. This model was further applied to screen the whole genomes of human and mouse for novel cytokine-receptor pairs.

CONSTRUCTION OF CytoSVM MODEL

CytoSVM is a model based on the statistical learning algorithm, SVM. This algorithm has been well-studied and implemented to solve a variety of protein classification problems including protein functional class (13,15), fold recognition (16), analysis of solvent accessibility (17), prediction of secondary structures (18) and protein–protein interactions (19). As a method that uses sequence-derived physicochemical properties of proteins as the basis for classification, SVM may be particularly useful for functional classification of distantly related proteins and homologous proteins of different functions (13). Such a feature makes SVM a potentially attractive method for probing the novel cytokine receptors, especially when the diversity of cytokine receptors in sequence cannot be properly handled by sequence homology-based approaches.

The data sets

The positive data pool

The positive data (the true cytokine-receptor interactions) were collected from the KEGG pathway database (2) and the literatures. These interaction pairs cover 449 distinct known cytokine-receptor interactions in mammals except rat. To be eligible for model construction, every sequence was represented by specific feature vector assembled from encoded representations of tabulated residue properties including amino acid composition, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility for each residue in the sequence (13,15–19). A positive vector of interaction pair was formed by joining the vectors of the cytokine and its complementary receptor. To enlarge the positive data pool, four virtual vectors were generated around each positive vector by slightly (about 1/1000 folds) increasing/decreasing the value of vector elements in multi-dimension space. As a result, total 2243 positive data (449 true positives and 1794 virtual positives) were prepared for model training.

The negative data pool

The negative data pool includes both the true and the virtual data. The true negatives are literature-reported 126 non-cytokine–protein interactions, which are very limited in the representation of sequential and structural features of non-cytokine–receptor interactions. To cover all possible negative conditions, a large number of virtual negative interaction pairs were generated as follows: 7816 seed sequences representing diverse domain families, excluding those containing any known cytokine or its receptor, were extracted from Pfam protein families database (20). These Pfam seeds were paired with, covering all possible combinations, mammal cytokines to form the virtual negative interactions. Same transformations from sequences to vectors were demonstrated to these negative interaction pairs as described earlier. Totally, about 1 million negative data were ready in negative data pool.

The SVM algorithm

The theory of SVM has been well described in literature (21,22). The structural and physicochemical features of a protein interaction are represented by a feature vector quantified from its primary sequence as described earlier. The vector is projected into a hyperspace wherein a hyperplane is used to classify this protein interaction pair as either positive (cytokine–receptor interaction) or negative (non-cytokine–receptor interaction) depending on the side of the hyperplane the vector is located. In this study, an RBF kernel function K(x, x) was adopted to map the input vector into a high dimensional feature space: The output of SVM model is the respective class of input, directly associated with the posterior probability by fitting a sigmoid (23): where f(x) is the output of SVM, and the parameters A and B are estimated from the negative log likelihood of the training data. A higher probability indicates the higher confidence of positive prediction.

Evaluation and performance measure

As in the case of all discriminative methods (24), the performance of SVM classification can be measured by: the quantity of true positives TP, true negatives TN, false positives FP, false negatives FN, sensitivity SE = TP/(TP + FN) which is the accuracy of cytokine–protein interaction prediction and specificity SP = TN/(TN + FP) which is the accuracy of non-cytokine–protein interaction prediction. The overall performance of the model can be measured both using the Matthews correlation coefficient (MCC) below: and a receiver operating characteristic (ROC) plot (25). ROC plot is a plot of the true positive rate against the false positive rate for the different possible thresholds of a model test. The area under the ROC curve (AUC) is usu-ally adopted as a scalar measure that gauges one facet of performance (25). In this study, ROC plot (Please refer to http://bioinf.xmu.edu.cn/software/cytosvm/help.htm#roccurve) and AUC were used to compare the performance of different SVM models (Table 1). It is shown that the enriched-SVM model with virtual positives (model M1) has the best performance, which was chosen to classify the cytokine–receptor interactions.

Table 1.

The descriptions of different SVM models

Model	Enriched training^a	Virtual Positives^b	Ratio of Positive/Negative^c	AUC^d
M1	Yes	Yes	1: 2.83	0.9692
M2	Yes	No	1: 3.37	0.9204
M3	No	No	1: 2.83	0.8856
M4	No	Yes	1: 3.31	0.8897
M5	No	Yes	1: 1.08	0.8353
M6	No	Yes	1: 6.02	0.8834

aYes’ means all virtual negative vectors are adopted for model training in an iterative manner (the enriched training). ‘No’ means only certain portion of random selection of negative vectors are used for model training.

b‘Yes’ means virtual positive vectors are adopted for model training. ‘No’ means no virtual positive vectors are used.

cThe ratio of positive vectors against negative vectors in the training data sets. The ratio for Models M1-4 is about 1:3, M5 is about 1:1 and M6 is about 1:6.

dThe area under receiver operating characteristic (ROC) plot. AUC is often used to measure the performance of models; the higher value indicates better performance.

The descriptions of different SVM models aYes’ means all virtual negative vectors are adopted for model training in an iterative manner (the enriched training). ‘No’ means only certain portion of random selection of negative vectors are used for model training. b‘Yes’ means virtual positive vectors are adopted for model training. ‘No’ means no virtual positive vectors are used. cThe ratio of positive vectors against negative vectors in the training data sets. The ratio for Models M1-4 is about 1:3, M5 is about 1:1 and M6 is about 1:6. dThe area under receiver operating characteristic (ROC) plot. AUC is often used to measure the performance of models; the higher value indicates better performance.

The enriched-SVM model

The model construction adopted all 2243 real and virtual positive vectors in positive data pool and about 1 million negative vectors in negative data pool. To represent all negative sequential and structural features and at the same time reduce the very unbalance between positive and negative data, the vectors in negative data pool were randomly divided into 230 groups of about 4200 negative vectors. These 230 negative data groups were combined with the 2243 positive data respectively to form totally 230 data sets for the construction of model. These data sets were arranged in the way of: 229 groups were used for independent trainings, while the remaining one was left for testing. The SVM model was initialized by 229 independent model trainings and optimized through several rounds of training in an enriched way. The negative support vectors (vectors close to the hyperplane on negative side) that decide the hyperplanes of the 229 independent models were extracted to form a new negative data pool. This pool was further arranged into groups for next round of learning process. The iterative learning process, or enriched selection of support vectors, was continued to seek the global optimally separating hyperplane (OSH) until the positive and negative data come to a near balance, of which the ratio is about 1:3 in this case. The optimized model was first tested by the remaining data set to assess its theoretical performance, which achieved sensitivity of 100%, specificity of 99.98% and MCC of 0.99. Considering the ‘overfitting’ problem due to the overtraining on the same data set, the model was further independently evaluated by 79 real cytokine–receptor interactions and 2360 generated negative data in rat, achieving sensitivity of 97.4%, specificity of 99.2% and MCC of 0.89 (Table 2). Such performance is comparable to other computational approaches in protein–protein interactions.

Table 2.

The evaluation of CytoSVM model

Testing set							Independent evaluation set

Positive			Negative			MCC	Positive			Negative			MCC

TP	FN	SE (%)	TN	FP	SP (%)		TP	FN	SE (%)	TN	FP	SP (%)
2343	0	100	4445	1	99.98	0.99	77	2	97.4	2343	17	99.2	0.89

TP: true positives; FN: false negatives; TN: true negatives; FP: false positives; SE: sensitivity SE = TP/(TP + FN); SP: specificity SP = TN/(TN + FP); MCC: Matthews correlation coefficient.

The evaluation of CytoSVM model TP: true positives; FN: false negatives; TN: true negatives; FP: false positives; SE: sensitivity SE = TP/(TP + FN); SP: specificity SP = TN/(TN + FP); MCC: Matthews correlation coefficient.

THE ACCESS OF SERVER AND DATABASE

The descriptions of server

The web-based server upon the optimized CytoSVM model can be freely accessed at http://bioinf.xmu.edu.cn/software/cytosvm/PredictReceptor.php (Figure 1). The server runs under Linux environment that allows user to submit the query through a PHP-coded dynamic interface. The default input of the server is the protein primary sequence of putative receptor/cytokine in standard FASTA format or raw data format. The server is case insensitive, however, wild characters like ‘*,-’ and non-amino acids characters will be removed from sequence automatically. An optional function of prediction by protein names is also provided. To initialize the prediction, user is required to choose a cytokine/receptor or cytokine/receptor families as well. The output of the server is the list of cytokines/receptors which are able to interact with query sequence with certain probabilities. Clicking on the name of a cytokine/receptor will lead user to the detailed information page, where user may find links to search for other putative receptors interacting with this cytokine in human or mouse genomes.

Figure 1.

The interface of CytoSVM server.

The descriptions of database

In this study, the well-optimized CytoSVM model was also applied to screen putative cytokine–receptor interactions in whole genomes of human and mouse. Finally, 1609 novel cytokine-receptor pairs with probability >80% (3346 pairs with probability >50%), covering 220 novel receptors (excluding their isoforms) for 126 human cytokines were identified in human genome after further transmembrane analysis (http://bioinf.xmu.edu.cn/software/cytosvm/statistics.php). These predicted results were deposited in a database at http://bioinf.xmu.edu.cn/software/cytosvm/BrowseSearch.php. The database is running upon Linux/Apache/PHP platform and maintained by RDBMS system of Oracle 9i, which enables multiple accesses simultaneously. User is allowed to search the putative receptors of a definite cytokine by selecting the item from the cytokine classification list (Figure 2). Quick search by keywords is also supported to find putative interactions of cytokines or receptors. Only interactions with probability value >50% will be responded for each single search. Clicking on the name of a cytokine or receptor will guide user into the detailed information page, where the general properties of the interactive partners are shown. Statistic of putative cytokine-receptor pairs in human genome and the help documents are also provided to aid database and server access.

Figure 2.

The interface of CytoSVM database.

CONCLUSION

In conclusion, a web-based enriched-SVM model, CytoSVM, was successfully constructed in this study to predict the putative cytokine–receptor interactions. As a complementary method to homologous-based methods and other computational approaches in protein–protein interaction prediction, CytoSVM shows its capability in functionally annotating those proteins that possess poor sequence similarity to known proteins. The application of CytoSVM in the discovery of novel cytokine–receptor interactions in genome scale broadens the understanding of cytokines’ physiological activities in the systematic level. Via these predicted interactants, the identification of novel cytokine-involved cellular processes is possible. Furthermore, it prompts the identification of new therapeutic targets for the treatment of various diseases. It is thus expected that experimental verifications could be demonstrated according to the clues provided by our study in the future.

20 in total

1. Protein interaction maps for complete genomes based on gene fusion events.

Authors: A J Enright; I Iliopoulos; N C Kyrpides; C A Ouzounis
Journal: Nature Date: 1999-11-04 Impact factor: 49.962

2. Multi-class protein fold recognition using support vector machines and neural networks.

Authors: C H Ding; I Dubchak
Journal: Bioinformatics Date: 2001-04 Impact factor: 6.937

3. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach.

Authors: S Hua; Z Sun
Journal: J Mol Biol Date: 2001-04-27 Impact factor: 5.469

CytoSVM: an advanced server for identification of cytokine-receptor interactions.

INTRODUCTION

CONSTRUCTION OF CytoSVM MODEL

The data sets

The positive data pool

The negative data pool

The SVM algorithm

Evaluation and performance measure

The enriched-SVM model

THE ACCESS OF SERVER AND DATABASE

The descriptions of server

The descriptions of database

CONCLUSION

1. Protein interaction maps for complete genomes based on gene fusion events.

2. Multi-class protein fold recognition using support vector machines and neural networks.

3. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach.

Review 4. Assessing the accuracy of prediction algorithms for classification: an overview.

5. Enhanced functional annotation of protein sequences via the use of structural descriptors.

Review 6. Cytokines: past, present, and future.

7. Classifying G-protein coupled receptors with support vector machines.

Review 8. Determination of protein function, evolution and interactions by structural genomics.

9. Predicting protein--protein interactions from primary structure.

10. Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome.

1. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy.