| Literature DB >> 28881966 |
Ishita K Khan1, Mansurul Bhuiyan2, Daisuke Kihara1,3.
Abstract
MOTIVATION: Moonlighting proteins (MPs) are an important class of proteins that perform more than one independent cellular function. MPs are gaining more attention in recent years as they are found to play important roles in various systems including disease developments. MPs also have a significant impact in computational function prediction and annotation in databases. Currently MPs are not labeled as such in biological databases even in cases where multiple distinct functions are known for the proteins. In this work, we propose a novel method named DextMP, which predicts whether a protein is a MP or not based on its textual features extracted from scientific literature and the UniProt database.Entities:
Mesh:
Year: 2017 PMID: 28881966 PMCID: PMC5870774 DOI: 10.1093/bioinformatics/btx231
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Data size for the MP/non-MP dataset
| #Proteins | #Titles | #Abstracts | #Functions | |
|---|---|---|---|---|
| MP | 263 | 2496 (214) | 1450 (158) | 194 |
| non-MP | 162 | 1665 (162) | 1624 (162) | 162 |
In the parenthesis the number of proteins is shown for which the text data was found. For example, out of 263 MPs, at least one publication title was found for 214 MPs.
Fig. 1Distribution of the number of abstracts per protein. Black, MP; gray, non-MP in the control dataset. The first bar is for 1 and 2 abstracts, next bar is for 3 and 4 and so on
Fig. 2Schematic diagram of DextMP. The upper panel shows the text prediction process while the bottom panel is for the prediction model that uses predicted text labels to make the final MP/non-MP classification. P1, Protein 1, CL: Class Label
Fig. 3Word clouds of text information of moonlighting protein dataset. The size of a word in the visualization is proportional to the number of times the word appears in the input text. (A–C): titles, function descriptions and abstracts, respectively. The images were generated at http://www.wordle.net/
Summary of the text-level prediction with different combinations of text types, language models and classifiers
| Text Type | Language Model | Classifiers | |||
|---|---|---|---|---|---|
| LR | RF | SVM | GBM | ||
| Titles | TFIDF | 0.7218 | |||
| LDA | 0.6128 | 0.6829 | 0.6584 | 0.7065 | |
| DEEP | 0.7696 | 0.7402 | 0.8429 | ||
| PDEEP | 0.6262 | 0.5482 | 0.4836 | 0.6445 | |
| Abstracts | TFIDF | ||||
| LDA | 0.6419 | 0.6936 | 0.6512 | 0.7349 | |
| DEEP | 0.7775 | 0.8119 | 0.8480 | 0.7987 | |
| PDEEP | – | – | – | – | |
| Function Descriptions | TFIDF | 0.7412 | 0.7439 | 0.7715 | 0.6947 |
| LDA | 0.6128 | 0.6829 | 0.6582 | 0.7065 | |
| DEEP | |||||
| PDEEP | 0.7017 | 0.7211 | 0.3474 | 0.6917 | |
Two-class weighted F-score was reported, where F-score of MP and non-MP was calculated and weighted average of them was taken, where the weights are the number of data points of each class. The values shown are the average of the test sets in the Five-fold cross-validation. LR, Logistic Regression; RF, Random Forest; SVM, Support Vector Machine; GBM, Gradient Boosted Machine.
Computational time (seconds)
| Phase | Text Type | Language model | ||
|---|---|---|---|---|
| TFIDF | LDA | DEEP | ||
| Training | Titles | 5.8*10−5 | 1.0*10−3 | 4.4*10−1 |
| Abstracts | 3.3*10−4 | 2.9*10−3 | 9.0*10−1 | |
| Function Dsc. | 6.3*10−4 | 1.5*10−2 | 1.2 | |
| Feature generation | Titles | 7.8*10−4 | 3.3*10−4 | 1.8*10−4 |
| Abstracts | 3.2*10−3 | 6.6*10−4 | 2.4*10−4 | |
| Function Dsc. | 2.2*10−3 | 1.0*10−3 | 2.0*10−4 | |
| Classification | Titles | 5.1*10−2 | 3.8*10−3 | 9.2*10−3 |
| Abstracts | 1.2*10−1 | 4.0*10−3 | 1.3*10−2 | |
| Function Dsc. | 7.0*10−2 | 5.3*10−3 | 6.1*10−3 | |
Fig. 4Protein-level cross-validation F-scores for weighted and non-weighted majority votes. Results for 21 (text type)-(language model)-(classifier) combinations are compared
Summary of the protein-level prediction
| Text Type | Language Model | Classifiers | |||
|---|---|---|---|---|---|
| LR | RF | SVM | GBM | ||
| Titles | TFIDF | ||||
| LDA | 0.5654 | 0.5723 | 0.5836 | 0.6227 | |
| DEEP | 0.6651 | 0.6698 | 0.7557 | 0.6826 | |
| PDEEP | 0.6611 | 0.5278 | 0.4314 | 0.6021 | |
| Abstracts | TFIDF | 0.7833 | |||
| LDA | 0.5459 | 0.5739 | 0.5342 | 0.5713 | |
| DEEP | 0.7650 | 0.8105 | 0.7747 | ||
| PDEEP | – | – | – | – | |
| Function Descriptions | TFIDF | 0.7412 | 0.7439 | 0.7715 | 0.6947 |
| LDA | 0.6128 | 0.6829 | 0.6582 | 0.7065 | |
| DEEP | |||||
| PDEEP | 0.7017 | 0.7211 | 0.3474 | 0.6917 | |
F-score was reported. The values shown are the average of the test sets in the five-fold cross validation. LR, Logistic Regression; RF, Random Forest; SVM, Support Vector Machine; GBM, Gradient Boosted Machine. For each text type, titles, abstracts and function descriptions, the best performing language model under four classifiers is highlighted in bold.
Genome-scale prediction by DextMP
| Yeast | Human | ||
|---|---|---|---|
| # Proteins | 6721 | 20 104 | 11 078 |
| Coverage | 96.73% | 98.06% | 30.54% |
| # MPs (%) (vote ≥ 3) | 2316 | 4781 | 600 |
| (34.46%) | (23.78%) | (5.42%) | |
| # MPs (%) (vote = 4) | 896 | 1682 | 279 |
| (13.33%) | (8.37%) | (2.51%) | |
| # known MPs | 23 | 45 | – |
| recall (vote ≥ 3) | 0.889 | 0.933 | – |
| recall (vote = 4) | 0.741 | 0.689 | – |
Coverage, the percentage of proteins in a genome that have both literature title and function descriptions, so that DextMP can run on them. Two prediction results are shown: the number of predicted MP proteins which are detected by three or more settings (vote ≥ 3) and the number of MPs detected by the all four settings unanimously (vote = 4). X.laevis does not have known MPs. The fraction in parentheses was computed for predicted MPs among all the proteins in the genome.