| Literature DB >> 32411695 |
Yideng Cai1, Jiacheng Wang1, Lei Deng1,2.
Abstract
The assignment of function to proteins at a large scale is essential for understanding the molecular mechanism of life. However, only a very small percentage of the more than 179 million proteins in UniProtKB have Gene Ontology (GO) annotations supported by experimental evidence. In this paper, we proposed an integrated deep-learning-based classification model, named SDN2GO, to predict protein functions. SDN2GO applies convolutional neural networks to learn and extract features from sequences, protein domains, and known PPI networks, and then utilizes a weight classifier to integrate these features and achieve accurate predictions of GO terms. We constructed the training set and the independent test set according to the time-delayed principle of the Critical Assessment of Function Annotation (CAFA) and compared it with two highly competitive methods and the classic BLAST method on the independent test set. The results show that our method outperforms others on each sub-ontology of GO. We also investigated the performance of using protein domain information. We learned from the Natural Language Processing (NLP) to process domain information and pre-trained a deep learning sub-model to extract the comprehensive features of domains. The experimental results demonstrate that the domain features we obtained are much improved the performance of our model. Our deep learning models together with the data pre-processing scripts are publicly available as an open source software at https://github.com/Charrick/SDN2GO.Entities:
Keywords: convolutional neural network; deep learning; deep multi-label classification; protein function; word embedding
Year: 2020 PMID: 32411695 PMCID: PMC7201018 DOI: 10.3389/fbioe.2020.00391
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1The integrated deep learning model architecture. (1) The Sequence sub-model utilizes 1-Dimensional convolutional neural networks to extract features from sequence input, which was encoded as 3-grams and then mapped to 3-grams-vector-matrix. (2) The PPI Net sub-model is generated to dense the features from PPI Network using classical neural networks. (3) The Domain sub-model initializes a Sparse layer, which is integrated into the sub-model to optimize, to generate a lookup table for domains, and the sorted domains sentence processed by the Sparse layer is entered into 1-Dimensional convolutional neural networks to extract features. (4) All the output features of the three sub-models are combined and entered into the Weighted Classifier, and the output vector represents the probability of GO terms.
Figure 2The architecture of one single GO classifier in the weighted classifier.
The 5-fold cross validation results of training data.
| SN2GO (human) | 0.473 | 0.441 | 0.908 | 0.546 | 0.527 | 0.938 | 0.587 | 0.600 | 0.949 |
| SDN2GO (human) | |||||||||
| SN2GO (yeast) | 0.414 | 0.289 | 0.810 | 0.548 | 0.435 | 0.870 | 0.520 | 0.395 | |
| SDN2GO (yeast) | 0.878 | ||||||||
The bold values indicate the best values.
Figure 3Precision-recall (P-R) curves of SDN2GO and SN2GO. The performances of the two methods were evaluated on the validation data of human in each sub-ontology of GO (gene ontology).
The comparison results of the competing method on the independent testing set.
| BLAST | 0.347 | 0.192 | 0.771 | 0.381 | 0.292 | 0.873 | 0.386 | 0.245 | 0.860 |
| DeepGO | 0.321 | 0.095 | 0.729 | 0.291 | 0.117 | 0.784 | 0.210 | 0.080 | 0.687 |
| NetGO | 0.173 | 0.048 | 0.594 | 0.386 | 0.243 | 0.919 | 0.217 | 0.092 | 0.669 |
| SN2GO | 0.132 | 0.044 | 0.893 | 0.423 | 0.306 | 0.953 | 0.384 | 0.264 | |
| SDN2GO | 0.947 | ||||||||
The bold values indicate the best values.
Figure 4Precision-recall (P-R) curves of BLAST, DeepGO, NetGO, SN2GO, and SDN2GO. The performances of the five methods were evaluated on the independent testing set in MFO (molecular function ontology).