| Literature DB >> 29949966 |
Seokjun Seo1, Minsik Oh1, Youngjune Park2, Sun Kim1,2,3.
Abstract
Motivation: A large number of newly sequenced proteins are generated by the next-generation sequencing technologies and the biochemical function assignment of the proteins is an important task. However, biological experiments are too expensive to characterize such a large number of protein sequences, thus protein function prediction is primarily done by computational modeling methods, such as profile Hidden Markov Model (pHMM) and k-mer based methods. Nevertheless, existing methods have some limitations; k-mer based methods are not accurate enough to assign protein functions and pHMM is not fast enough to handle large number of protein sequences from numerous genome projects. Therefore, a more accurate and faster protein function prediction method is needed.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29949966 PMCID: PMC6022622 DOI: 10.1093/bioinformatics/bty275
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The overview of DeepFam model. It is a feedforward convolution neural network whose last layer represents the probabilities of each family. convolution layer and 1-max pooling layer calculate a score (activation) of the existence of a conserved regions. The next layer is fully-connected neural network which can detect longer or complex sites. In order to infer the probability of each family, the last layer is designed as softmax layer (multinomial logistic regression), generally used for multi-class classification
Notations of variables
| Maximum length among all sequences | |
| {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, X} | |
| Size of | |
| Number of sequences | |
| Number of families | |
| Set of families | |
| Number of convolution units | |
| Number of hidden units on fully connected layer | |
| Length of |
Fig. 2.The error rates of DeepFam models with different hyperparameter combinations. The error rate is calculated as , which means that the lower the error is, the better the hyperparameter setting is
Prediction accuracy (%) comparison of COG dataset
| Dataset | COG-500-1074 | COG-250-1796 | COG-100-2892 |
|---|---|---|---|
| 91.40 | |||
| pHMM | 91.75 | 91.78 | |
| 3-mer LR | 85.59 | 81.15 | 75.44 |
| Protvec LR | 47.34 | 41.76 | 37.05 |
Bold indicates the best performance for each dataset.
Prediction accuracy (%) comparison of GPCR dataset in each level
| Method | Family | Sub-family | Sub-subfamily |
|---|---|---|---|
| pHMM | 95.77 | 85.39 | 78.50 |
| 3-mer LR | 95.59 | 83.51 | 77.06 |
| Protvec LR | 88.58 | 74.98 | 67.32 |
| Selective top-down* | 95.87 | 80.77 | 69.98 |
| Naive Bayes* | 77.29 | 52.60 | 36.66 |
| Bayesian network* | 85.24 | 64.27 | 50.69 |
| SMO* | 80.21 | 56.67 | 35.96 |
| Nearest Neighbour* | 95.87 | 78.68 | 69.40 |
| PART* | 93.27 | 78.73 | 65.68 |
| J48* | 92.93 | 77.49 | 64.30 |
| Naive Bayesian Tree* | 93.07 | 76.92 | 64.78 |
| AIRS2* | 91.98 | 74.58 | 62.68 |
| Conjunctive Rules* | 76.19 | 49.93 | 16.49 |
Note: Results marked with *are extended from Davies .
Bold indicates the best performance for each dataset.
Fig. 3.The average elapsed time of five trials to predict families of 1000 protein sequences for each method. Three independent models were built for each method, which were trained to predict one of 1074, 1796 and 2892 families by using COG-500-1074, COG-250 1796 and COG-100-2892 dataset respectively
Fig. 4.Visualizing MEME motifs, convolution units responsive to COG0517 sequences, PS51371 motifs and convolution units responsive to PS51371 sequences. (a) depicts the highest rank MEME motif and corresponding logos and (b) depicts the second highest rank MEME motif and corresponding logos. Only selected conserved regions are shown for PS51371 motifs because raw PS51371 logo is too long and has too many unconserved regions as PS51371 logo is generated from multiple sequence alignment