| Literature DB >> 29017445 |
Shumin Li1, Junjie Chen1, Bin Liu2.
Abstract
BACKGROUND: Protein remote homology detection plays a vital role in studies of protein structures and functions. Almost all of the traditional machine leaning methods require fixed length features to represent the protein sequences. However, it is never an easy task to extract the discriminative features with limited knowledge of proteins. On the other hand, deep learning technique has demonstrated its advantage in automatically learning representations. It is worthwhile to explore the applications of deep learning techniques to the protein remote homology detection.Entities:
Keywords: Bidirectional Long Short-Term Memory; Neural network; Protein remote homology detection; Protein sequence analysis
Mesh:
Substances:
Year: 2017 PMID: 29017445 PMCID: PMC5634958 DOI: 10.1186/s12859-017-1842-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The structure of ProDec-BLSTM. The input layer converts the pseudo proteins into feature vectors by one-hot encoding. Next, the subsequences within the sliding window are fed into the bidirectional LSTM layer for extracting the sequence patterns. Then, the time distributed dense layer weights the extracted patterns. Finally, the extracted feature vectors are fed into output layer for prediction
Fig. 2The structure of LSTM memory cell. There are three gates, including input gate (marked as i), forget gate (marked as f), output gate (marked as o), to control the information stream flowing in and out the block. σ denotes the sigmoid function, which produces a value bounded by 0 and 1. The internal cell state is maintained and updated by the coordination of input gate and forget gate. The output gate controls outputting information stored in the cell. h is the output of the memory cell, x is representing matrix of the input subsequence and t mean the t th time step
Mean ROC and ROC50 scores of various methods on the SCOP benchmark dataset (Eq. 1)
| Methods | Mean ROC | Mean ROC50 | classifier |
|---|---|---|---|
| GPkernel | 0.902 | 0.591 | SVM |
| GPextended | 0.869 | 0.542 | SVM |
| GPboost | 0.797 | 0.375 | SVM |
| SVM-Pairwise | 0.849 | 0.555 | SVM |
| Mismatch | 0.878 | 0.543 | SVM |
| eMOTIF | 0.857 | 0.551 | SVM |
| LA-kernel | 0.919 | 0.686 | SVM |
| PSI-BLAST | 0.575 | 0.175 | NA |
| LSTM | 0.943 | 0.735 | LSTM |
| ProDec-BLSTM | 0.969 | 0.849 | LSTM |
Fig. 3Feature visualization of ProDec-BLSTM for the protein family b.1.1.1. The positive samples and negative samples are shown in red color and blue color, respectively
Mean ROC and ROC50 scores of related methods on the SCOPe independent dataset
| Method | Mean ROC | Mean ROC50 |
|---|---|---|
| HHblitsa | 0.725 | 0.443 |
| Hmmerb | 0.556 | 0.145 |
| PSI-BLASTc | 0.668 | 0.096 |
| ProtDec-LTRd | 0.742 | 0.445 |
| ProDec-BLSTM | 0.970 | 0.714 |
athe command line of HHblits is ‘-e 1 -p 0 -E inf -Z 10000 -B 10000 -b 10000’
bThe parameters of Hmmer are set as default
cThe paramters of PSI-BLAST are set as default
dThe above three alignment-based methods are combined by ProDec-LTR. The model is trained with SCOP benchmark dataset (Eq. 1)