| Literature DB >> 26347773 |
Masayuki Yarimizu1, Cao Wei1, Yusuke Komiyama1, Kokoro Ueki1, Shugo Nakamura1, Kazuya Sumikoshi1, Tohru Terada1, Kentaro Shimizu1.
Abstract
Receptor tyrosine kinases are essential proteins involved in cellular differentiation and proliferation in vivo and are heavily involved in allergic diseases, diabetes, and onset/proliferation of cancerous cells. Identifying the interacting partner of this protein, a growth factor ligand, will provide a deeper understanding of cellular proliferation/differentiation and other cell processes. In this study, we developed a method for predicting tyrosine kinase ligand-receptor pairs from their amino acid sequences. We collected tyrosine kinase ligand-receptor pairs from the Database of Interacting Proteins (DIP) and UniProtKB, filtered them by removing sequence redundancy, and used them as a dataset for machine learning and assessment of predictive performance. Our prediction method is based on support vector machines (SVMs), and we evaluated several input features suitable for tyrosine kinase for machine learning and compared and analyzed the results. Using sequence pattern information and domain information extracted from sequences as input features, we obtained 0.996 of the area under the receiver operating characteristic curve. This accuracy is higher than that obtained from general protein-protein interaction pair predictions.Entities:
Year: 2015 PMID: 26347773 PMCID: PMC4548105 DOI: 10.1155/2015/528097
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Figure 1Flow of dataset generation. Data from the Database of Interacting Proteins (DIP) were filtered by selecting only pairs with a protein in the search result of UniProtKB [10] using a keyword and EC number search (“tyrosine kinase” AND EC:2.7.10.1 AND reviewed: yes). To exclude redundancy, clustering was performed against the 174 hits obtained from the interacting pairs of receptor tyrosine kinases and ligands using BLASTclust and one protein was extracted from each cluster. An identity of 80% or above within the 100% region of the amino acid sequence was set as the criteria for BLASTclust. As a result of clustering, 34 receptor tyrosine kinases and 67 ligand proteins were extracted. On the basis of these procedures, 95 pairs were obtained as the final positive data for protein-protein interaction. Negative data (2183) were artificially prepared by excluding the 95 positive data hits from all the combinations of the retrieved receptor tyrosine kinases and their above ligands.
Figure 2Domain and superfamily encoding. A receptor contains domains r 1, r 3, and r 4. Domains r 1 and r 3 belong to superfamily R 2 and domain r 4 belongs to superfamily R 3. A ligand contains domains l 1, l 2, and r 4, which belong to superfamilies L 1, L 2, and L 4, respectively.
Prediction results of each method.
| Method | Precision | Recall |
| AUC | TP | FP | TN | FN |
|---|---|---|---|---|---|---|---|---|
| 1-mer frequency | — | — | — | 0.638 | 0 | 0 | 2183 | 95 |
| 2-mer frequency | 0.178 | 0.442 | 0.253 | 0.713 | 42 | 191 | 1992 | 53 |
| Domain level | 0.387 | 0.337 | 0.357 | 0.801 | 32 | 48 | 2135 | 63 |
| Superfamily level | 1.0 | 0.674 | 0.802 | 0.974 | 64 | 0 | 2183 | 31 |
| Composite vector (2-mer frequency + superfamily level) | 0.984 | 0.694 | 0.812 | 0.996 | 66 | 1 | 2182 | 29 |
| Combined results (2-mer frequency + superfamily level) | 0.868 | 0.611 | 0.712 | 0.906 | 58 | 10 | 2173 | 37 |
Each column in the table describes a single method. “Composite vector” describes the prediction based on the combination of feature vectors from 2-mer frequency method and domain, and “combined results” denotes the combination of the predicted results from 2-mer frequency method and domain (superfamily). TP, FP, TN, and FN: true positive, false positive, true negative, and false negative, respectively.
Figure 3ROC curve of each method. (a) Prediction based on single features. (b) Prediction based on combining methods.