| Literature DB >> 29287069 |
Yu-Hui Qu1,2, Hua Yu1,2, Xiu-Jun Gong1,2, Jia-Hui Xu3, Hong-Shun Lee1,2.
Abstract
DNA-binding proteins play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions for both eukaryotic and prokaryotic proteomes. Predicting the functions of these proteins from primary amino acids sequences is becoming one of the major challenges in functional annotations of genomes. Traditional prediction methods often devote themselves to extracting physiochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a deep learning based method to identify DNA-binding proteins from primary sequences alone. It utilizes two stages of convolutional neutral network to detect the function domains of protein sequences, and the long short-term memory neural network to identify their long term dependencies, an binary cross entropy to evaluate the quality of the neural networks. When the proposed method is tested with a realistic DNA binding protein dataset, it achieves a prediction accuracy of 94.2% at the Matthew's correlation coefficient of 0.961. Compared with the LibSVM on the arabidopsis and yeast datasets via independent tests, the accuracy raises by 9% and 4% respectively. Comparative experiments using different feature extraction methods show that our model performs similar accuracy with the best of others, but its values of sensitivity, specificity and AUC increase by 27.83%, 1.31% and 16.21% respectively. Those results suggest that our method is a promising tool for identifying DNA-binding proteins.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29287069 PMCID: PMC5747425 DOI: 10.1371/journal.pone.0188129
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Equal data set.
| Data set | DNA-binding | non-DNA-binding | Total |
|---|---|---|---|
| Original set | 42,257 | 42,310 | 84,567 |
| Train set | 33,805 | 33,848 | 67,653 |
| Test set | 8,452 | 8,462 | 16,914 |
| Yeast | 100 | 100 | 200 |
| Arabidopsis | 100 | 100 | 200 |
Realistic data set.
| Data set | DNA-binding | non-DNA-binding | Total |
|---|---|---|---|
| Original set | 42,257 | 341,481 | 383,738 |
| Train set | 33,805 | 273,185 | 306,990 |
| Test set | 8,452 | 68,296 | 76,748 |
| Validation set | 104 | 480 | 584 |
Multi-species data set.
| Species | Data set | DNA-binding | non-DNA-binding | Total |
|---|---|---|---|---|
| Human | Original set | 6,932 | 6,932 | 13,864 |
| Train set | 5,546 | 5,546 | 11,092 | |
| Test set | 1,386 | 1,386 | 2,772 | |
| Mouse | Original set | 4,883 | 4,883 | 9,766 |
| Train set | 3,907 | 3,907 | 7,814 | |
| Test set | 976 | 976 | 1,952 | |
| Rice | Original set | 4,501 | 4,501 | 9,002 |
| Train set | 3,601 | 3,601 | 7,202 | |
| Test set | 900 | 900 | 1,800 |
Low-redundancy versions of the equal and realistic datasets.
| Data set | DNA-binding | non-DNA-binding | Total |
|---|---|---|---|
| Equal dataset | 17,327 | 26,443 | 43,770 |
| Realistic dataset | 17,327 | 125,792 | 143,119 |
Fig 1Architecture of the deep learning model.
The amino acids encoder.
| Amino acids | Letters | Code |
|---|---|---|
| Alanine | A | 1 |
| Cysteine | C | 2 |
| Aspartic | D | 3 |
| Glutamic | E | 4 |
| Phenylalanine | F | 5 |
| Glycine | G | 6 |
| Histidine | H | 7 |
| Isoleucine | I | 8 |
| Lysine | K | 9 |
| Leucine | L | 10 |
| Methionine | M | 11 |
| Asparagine | N | 12 |
| Proline | P | 13 |
| Glutamine | Q | 14 |
| Arginine | R | 15 |
| Serine | S | 16 |
| Threonine | T | 17 |
| Valine | V | 18 |
| Tryptophan | W | 19 |
| Tyrosine | Y | 20 |
| Illegal Amino acids | B, J, O, U, X, Z | 0 |
Fig 2The structure of convolution neutral network.
The model uses 2 filters to obtain 2 feature maps, then apply a max-overtime pooling operation over the feature map and take the maximum value as the feature corresponding to the filter.
Fig 3Long short-term memory cell.
The parameters and output sizes of each layer.
| Layers | Parameters | Output_size |
|---|---|---|
| Input | sentence_length = 1000 | (128, 1000) |
| n_batches = 128 | ||
| Embedding Layer | input_dim = 21 | (128, 1000, 128) |
| output_dim = 128 | ||
| Convolution Layer 1 | filters = 64 | (128, 991, 64) |
| filter_length = 10 | ||
| activation = relu | ||
| MaxPooling | pooling_length = 2 | (128, 496, 64) |
| Convolution Layer 2 | filters = 64 | (128, 492, 64) |
| filter_length = 5 | ||
| activation = relu | ||
| MaxPooling | pooling_length = 2 | (128, 246, 64) |
| Lstm Layer | lstm_output_size = 70 | (128, 70) |
| Output | activation = sigmoid | (128, 1) |
Fig 4Results of 5-fold cross validation.
The prediction accuracies across different models.
| Model | Test data set | Accuracy |
|---|---|---|
| LibSVM | Arabidopsis(200) | 0.81 |
| Yeast(200) | 0.76 | |
| DNA Binder(10) | Arabidopsis(200) | 0.74 |
| Yeast(200) | 0.67 | |
| Arabidopsis(200) | ||
| Yeast(200) |
The results in the realistic data set.
| Test data | Acc | Sensitivity | Specifity | Auc |
|---|---|---|---|---|
| Test data(76,748) | 0.942 | 0.884 | 0.916 | 0.961 |
| Validation data(584) | 0.825 | 0.873 | 0.712 | 0.851 |
Fig 5The ROC of the test set.
Fig 6The ROC of the validation set.
The results in multi-species dataset.
| Train set | Test set | Accuracy |
|---|---|---|
| human(11,092) | human(2,772) | 0.8294 |
| mouse(1,952) | 0.839 | |
| rice(1,800) | 0.739 | |
| mouse(7,814) | human(2,772) | 0.798 |
| mouse(1,952) | 0.7473 | |
| rice(1,800) | 0.7479 | |
| rice(7,202) | human(2,772) | 0.75 |
| mouse(1,952) | 0.719 | |
| rice(1,800) | 0.918 |
Performance comparisons on the equal dataset.
| Method | Accuracy | |
|---|---|---|
| 188D | LR | 0.7607 |
| SVM | 0.9078 | |
| RF | 0.8776 | |
| AC | LR | 0.6789 |
| SVM | 0.7993 | |
| RF | 0.8360 | |
| CT | LR | 0.7080 |
| SVM | 0.8565 | |
| RF | 0.8588 | |
Performance comparisons on the realistic dataset.
| Method | Accuracy | Sensitivity | Specificity | Auc | |
|---|---|---|---|---|---|
| 188D | LR | 0.8940 | 0.1442 | 0.5560 | 0.5651 |
| SVM | 0.9500 | 0.6057 | 0.9029 | 0.7989 | |
| RF | 0.9581 | 0.6213 | 0.9801 | 0.8099 | |
| AC | LR | 0.8922 | 0.0220 | 0.5690 | 0.5100 |
| SVM | 0.9255 | 0.3296 | 0.9495 | 0.6637 | |
| RF | 0.8919 | 0.0308 | 0.5949 | 0.5141 | |
| CT | LR | 0.8925 | 0.0898 | 0.5745 | 0.5408 |
| SVM | 0.9219 | 0.3230 | 0.8969 | 0.6592 | |
| RF | 0.8920 | 0.0290 | 0.6055 | 0.5133 | |
The results in low-redundancy equal data set.
| Method | Accuracy |
|---|---|
| full model | 0.928 |
| low-redundancy model | 0.8849 |
| 188D+SVM | 0.8745 |
The results in low-redundancy realistic data set.
| Method | Accuracy | Sensitivity | Specificity | Auc |
|---|---|---|---|---|
| full model | 0.942 | 0.884 | 0.916 | 0.961 |
| low-redundancy model | 0.8638 | 0.5138 | 0.8461 | 0.7129 |
| 188D+SVM | 0.8435 | 0.3340 | 0.7063 | 0.6502 |
Fig 7Loss comparisons in different models.
Fig 8ACC comparisons in different models.