| Literature DB >> 29316706 |
Chun Yan Yu1,2, Xiao Xu Li3,4, Hong Yang5,6, Ying Hong Li7,8, Wei Wei Xue9, Yu Zong Chen10, Lin Tao11, Feng Zhu12,13.
Abstract
The function of a protein is of great interest in the cutting-edge research of biological mechanisms, disease development and drug/target discovery. Besides experimental explorations, a variety of computational methods have been designed to predict protein function. Among these in silico methods, the prediction of BLAST is based on protein sequence similarity, while that of machine learning is also based on the sequence, but without the consideration of their similarity. This unique characteristic of machine learning makes it a good complement to BLAST and many other approaches in predicting the function of remotely relevant proteins and the homologous proteins of distinct function. However, the identification accuracies of these in silico methods and their false discovery rate have not yet been assessed so far, which greatly limits the usage of these algorithms. Herein, a comprehensive comparison of the performances among four popular prediction algorithms (BLAST, SVM, PNN and KNN) was conducted. In particular, the performance of these methods was systematically assessed by four standard statistical indexes based on the independent test datasets of 93 functional protein families defined by UniProtKB keywords. Moreover, the false discovery rates of these algorithms were evaluated by scanning the genomes of four representative model organisms (Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae and Mycobacterium tuberculosis). As a result, the substantially higher sensitivity of SVM and BLAST was observed compared with that of PNN and KNN. However, the machine learning algorithms (PNN, KNN and SVM) were found capable of substantially reducing the false discovery rate (SVM < PNN < KNN). In sum, this study comprehensively assessed the performance of four popular algorithms applied to protein function prediction, which could facilitate the selection of the most appropriate method in the related biomedical research.Entities:
Keywords: BLAST; false discovery rate; machine learning; protein function prediction; support vector machine
Mesh:
Year: 2018 PMID: 29316706 PMCID: PMC5796132 DOI: 10.3390/ijms19010183
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Statistical differences in the performance of four protein function prediction algorithms (BLAST, SVM, PNN and KNN) assessed by four metrics: (A) sensitivity (SE); (B) specificity (SP); (C) accuracy (ACC); and (D) Matthews correlation coefficient (MCC). Significant and moderately significant differences were shown by a p-value of (**), respectively.
The performance of four protein function prediction algorithms assessed by four popular metrics: sensitivity (SE), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC).
| UniProt Keyword | Protein Functional Family | GO Category | BLAST | SVM | PNN | KNN | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KW-0020 | Allergen | - | 76.32 | 98.92 | 98.78 | 0.48 | 84.81 | 99.69 | 99.66 | 0.57 | 86.42 | 99.84 | 99.81 | 0.69 | 74.07 | 99.48 | 99.32 | 0.41 |
| KW-0049 | Antioxidant | GO:0016209 | 94.15 | 99.23 | 99.20 | 0.60 | 89.00 | 99.76 | 99.73 | 0.67 | 86.00 | 99.84 | 99.80 | 0.71 | 69.00 | 99.42 | 99.24 | 0.43 |
| KW-0117 | Actin capping | GO:0051693 | 94.55 | 99.08 | 99.07 | 0.35 | 93.98 | 99.75 | 99.74 | 0.70 | 91.18 | 99.80 | 99.78 | 0.71 | 73.53 | 99.42 | 99.22 | 0.43 |
| KW-0147 | Chitin-binding | GO:0008061 | 86.96 | 98.96 | 98.94 | 0.34 | 91.75 | 99.72 | 99.68 | 0.78 | 75.36 | 99.61 | 99.47 | 0.63 | 93.84 | 98.57 | 98.05 | 0.37 |
| KW-0157 | Chromophore | GO:0018298 | 96.70 | 98.54 | 98.51 | 0.70 | 93.83 | 99.74 | 99.68 | 0.86 | 86.91 | 99.66 | 99.52 | 0.80 | 89.38 | 99.48 | 98.53 | 0.59 |
| KW-0195 | Cyclin | GO:0061575 | 89.34 | 98.92 | 98.89 | 0.44 | 97.96 | 99.78 | 99.78 | 0.60 | 89.80 | 99.84 | 99.83 | 0.62 | 75.51 | 99.63 | 99.53 | 0.39 |
| KW-0251 | Elongation factor | GO:0003746 | 99.51 | 98.57 | 98.60 | 0.83 | 96.72 | 99.67 | 99.62 | 0.92 | 84.14 | 99.67 | 99.29 | 0.85 | 95.84 | 99.46 | 97.21 | 0.63 |
| KW-0339 | Growth factor | GO:0008083 | 94.05 | 98.99 | 98.95 | 0.65 | 84.30 | 99.69 | 99.62 | 0.76 | 86.01 | 99.81 | 99.72 | 0.80 | 76.54 | 99.66 | 99.16 | 0.61 |
| KW-0343 | GTPase activation | GO:0005096 | 76.06 | 98.57 | 98.40 | 0.47 | 92.45 | 99.67 | 99.65 | 0.66 | 86.73 | 99.82 | 99.78 | 0.72 | 61.95 | 99.44 | 99.25 | 0.46 |
| KW-0344 | Guanine-nucleotide releasing factor | GO:0005085 | 74.09 | 98.57 | 98.44 | 0.39 | 83.33 | 99.72 | 99.69 | 0.57 | 89.74 | 99.64 | 99.62 | 0.56 | 93.59 | 99.15 | 98.95 | 0.31 |
| KW-0396 | Initiation factor | GO:0003743 | 96.88 | 98.92 | 98.86 | 0.83 | 91.36 | 99.66 | 99.50 | 0.87 | 74.21 | 99.82 | 99.32 | 0.81 | 77.64 | 99.45 | 97.98 | 0.65 |
| KW-0497 | Mitogen | GO:0051781 | 89.25 | 98.98 | 98.96 | 0.40 | 92.74 | 99.73 | 99.66 | 0.85 | 83.60 | 99.61 | 99.45 | 0.75 | 85.22 | 99.62 | 98.78 | 0.62 |
| KW-0505 | Motor protein | GO:0098840 | 93.38 | 98.96 | 98.91 | 0.63 | 89.47 | 99.75 | 99.72 | 0.69 | 80.70 | 99.86 | 99.80 | 0.72 | 64.04 | 99.45 | 99.25 | 0.46 |
| KW-0514 | Muscle protein | - | 94.22 | 98.95 | 98.92 | 0.57 | 95.38 | 99.75 | 99.73 | 0.74 | 89.23 | 99.69 | 99.65 | 0.67 | 80.00 | 99.60 | 99.32 | 0.51 |
| KW-0515 | Mutator protein | GO:1990633 | 97.65 | 98.97 | 98.97 | 0.42 | 83.82 | 99.79 | 99.76 | 0.60 | 77.94 | 99.84 | 99.80 | 0.61 | 70.59 | 99.45 | 99.32 | 0.38 |
| KW-0568 | Pathogenesis related protein | GO:0009607 | 92.86 | 98.98 | 98.97 | 0.29 | 93.36 | 99.78 | 99.74 | 0.89 | 94.87 | 99.63 | 99.58 | 0.84 | 91.20 | 99.71 | 98.72 | 0.64 |
| KW-0734 | Signal transduction inhibitor | GO:0009968 | 81.25 | 98.97 | 98.94 | 0.31 | 84.62 | 99.71 | 99.69 | 0.45 | 84.62 | 99.68 | 99.66 | 0.43 | 87.18 | 99.63 | 99.54 | 0.34 |
| KW-0786 | Thiamine pyrophosphate binding | - | 97.08 | 98.95 | 98.93 | 0.71 | 96.04 | 99.73 | 99.70 | 0.85 | 87.70 | 99.89 | 99.79 | 0.87 | 74.76 | 99.43 | 98.80 | 0.58 |
| KW-0830 | Ubiquinone binding | - | 98.37 | 98.50 | 98.49 | 0.87 | 94.07 | 99.72 | 99.56 | 0.92 | 82.58 | 99.46 | 98.98 | 0.82 | 91.47 | 99.73 | 97.20 | 0.68 |
| KW-0847 | Vitamin C binding | GO:0031418 | 94.21 | 98.96 | 98.94 | 0.46 | 91.89 | 99.79 | 99.78 | 0.53 | 97.30 | 99.69 | 99.69 | 0.48 | 81.08 | 99.64 | 99.56 | 0.35 |
The false discovery rate assessed by the percentage of proteins identified from human and thaliana genomes by different algorithms.
| UniProt Keyword | Protein Functional Family | Homo Sapiens | Arabidopsis Thaliana | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| UniProt (%) | SVM (%) | BLAST (%) | PNN (%) | KNN (%) | UniProt (%) | SVM (%) | BLAST (%) | PNN (%) | KNN (%) | ||
| KW-0117 | Actin capping | 0.09 | 0.12 | 0.72 | 0.10 | 0.10 | 0.05 | 0.07 | 0.11 | 0.05 | 0.05 |
| KW-0020 | Allergen | 0.02 | 0.18 | 3.68 | 0.11 | 0.04 | 0.01 | 0.17 | 6.22 | 0.07 | 0.09 |
| KW-0049 | Antioxidant | 0.07 | 0.09 | 0.50 | 0.08 | 0.07 | 0.09 | 0.16 | 1.11 | 0.12 | 0.13 |
| KW-0147 | Chitin-binding | 0.02 | 0.16 | 0.36 | 0.02 | 0.10 | 0.08 | 0.24 | 3.57 | 0.08 | 0.18 |
| KW-0157 | Chromophore | 0.07 | 0.15 | 2.10 | 0.07 | 0.10 | 0.28 | 0.38 | 0.88 | 0.23 | 0.30 |
| KW-0195 | Cyclin | 0.16 | 0.24 | 0.40 | 0.18 | 0.19 | 0.33 | 0.36 | 0.61 | 0.34 | 0.34 |
| KW-0251 | Elongation factor | 0.08 | 0.11 | 0.45 | 0.08 | 0.09 | 0.15 | 0.19 | 0.48 | 0.14 | 0.16 |
| KW-0339 | Growth factor | 0.65 | 0.93 | 2.50 | 0.71 | 0.73 | 0.12 | 0.18 | 0.24 | 0.13 | 0.14 |
| KW-0343 | GTPase activation | 0.97 | 1.19 | 5.47 | 0.93 | 1.02 | 0.28 | 0.24 | 1.36 | 0.21 | 0.23 |
| KW-0344 | Guanine-nucleotide releasing factor | 0.73 | 0.86 | 5.37 | 0.73 | 0.75 | 0.18 | 0.20 | 2.12 | 0.17 | 0.19 |
| KW-0396 | Initiation factor | 0.24 | 0.39 | 1.70 | 0.26 | 0.25 | 0.26 | 0.38 | 1.71 | 0.24 | 0.28 |
| KW-0497 | Mitogen | 0.20 | 0.65 | 4.37 | 0.30 | 0.35 | 0.00 | 0.07 | 0.52 | 0.01 | 0.02 |
| KW-0505 | Motor protein | 0.66 | 0.75 | 4.07 | 0.67 | 0.67 | 0.59 | 0.45 | 2.14 | 0.34 | 0.42 |
| KW-0514 | Muscle protein | 0.31 | 0.42 | 4.35 | 0.37 | 0.39 | 0.00 | 0.17 | 1.26 | 0.11 | 0.13 |
| KW-0515 | Mutator protein | 0.01 | 0.02 | 0.05 | 0.01 | 0.01 | 0.01 | 0.01 | 0.05 | 0.01 | 0.01 |
| KW-0568 | Pathogenesis-related protein | 0.00 | 0.08 | 0.09 | 0.04 | 0.05 | 0.13 | 0.20 | 0.91 | 0.15 | 0.16 |
| KW-0734 | Signal transduction inhibitor | 0.22 | 0.23 | 1.22 | 0.21 | 0.21 | 0.01 | 0.01 | 0.74 | 0.01 | 0.01 |
| KW-0786 | Thiamine pyrophosphate binding | 0.06 | 0.07 | 0.13 | 0.06 | 0.06 | 0.12 | 0.15 | 0.28 | 0.13 | 0.14 |
| KW-0830 | Ubiquinone binding | 0.08 | 0.71 | 0.12 | 0.19 | 0.60 | 0.13 | 0.25 | 0.42 | 0.17 | 0.18 |
| KW-0847 | Vitamin C binding | 0.10 | 0.12 | 0.18 | 0.10 | 0.09 | 0.07 | 0.11 | 0.53 | 0.07 | 0.08 |
Figure 2The false discovery rates reflected by the percentage of proteins identified from the genomes of (a) Homo sapiens, (b) Arabidopsis thaliana, (c) Saccharomyces cerevisiae and (d) Mycobacterium tuberculosis.
Figure 3The false discovery rates reflected by the percentage of proteins of 15 protein families only existing in plants, microbes or viruses, but not existing in the human genome identified from the genomes of Homo sapiens.