| Literature DB >> 26230689 |
Sheng Wang1,2, Shunyan Weng3, Jianzhu Ma4, Qingming Tang5.
Abstract
Intrinsically disordered proteins or protein regions are involved in key biological processes including regulation of transcription, signal transduction, and alternative splicing. Accurately predicting order/disorder regions ab initio from the protein sequence is a prerequisite step for further analysis of functions and mechanisms for these disordered regions. This work presents a learning method, weighted DeepCNF (Deep Convolutional Neural Fields), to improve the accuracy of order/disorder prediction by exploiting the long-range sequential information and the interdependency between adjacent order/disorder labels and by assigning different weights for each label during training and prediction to solve the label imbalance issue. Evaluated by the CASP9 and CASP10 targets, our method obtains 0.855 and 0.898 AUC values, which are higher than the state-of-the-art single ab initio predictors.Entities:
Keywords: conditional neural field; deep convolutional neural network; deep learning; intrinsically disordered proteins; machine learning; prediction of disordered regions
Mesh:
Substances:
Year: 2015 PMID: 26230689 PMCID: PMC4581195 DOI: 10.3390/ijms160817315
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Overall properties of the non-terminal disordered regions with different lengths on the four datasets used in this work.
| Datasets | Length of Disordered Regions | Number of Fragments of Disordered Regions | ||||||
|---|---|---|---|---|---|---|---|---|
| 1–5 | 6–15 | 16–25 | >25 | 1–5 | 6–15 | 16–25 | >25 | |
| Disorder723 | 964 | 2083 | 883 | 852 | 492 | 226 | 45 | 19 |
| UniProt90 | 12,804 | 37,420 | 16,646 | 22,655 | 4133 | 4093 | 852 | 514 |
| CASP9 | 272 | 494 | 215 | 119 | 118 | 52 | 11 | 3 |
| CASP10 | 163 | 261 | 113 | 55 | 73 | 31 | 6 | 2 |
AUC values of different layer models on 10 cross validation batch datasets of Disorder723. Note that other model parameters are fixed as default. The best value is shown in bold (the same convention is used in Table 3, Table 4, Table 5, Table 6 and Table 7).
| Number of Hidden Layers | AUC Value of 10 cross Validation Batch Datasets | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Mean | |
| 1 | 0.878 | 0.925 | 0.842 | 0.853 | 0.882 | 0.835 | 0.871 | 0.868 | 0.898 | 0.842 | 0.869 |
| 2 |
|
|
|
|
|
|
|
|
|
|
|
| 3 | 0.887 | 0.936 | 0.863 | 0.873 | 0.902 | 0.852 | 0.887 | 0.908 | 0.923 | 0.857 | 0.889 |
AUC values of different combinations of weight ratio between order and disorder states on 10 cross validation batch datasets of Disorder723.
| Weight Ratio | AUC Value of 10 cross Validation Batch Datasets | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Mean | |
| 5:5 | 0.884 | 0.932 | 0.857 | 0.871 | 0.902 | 0.842 | 0.884 | 0.895 | 0.921 | 0.850 | 0.884 |
| 2:8 | 0.892 | 0.938 | 0.863 | 0.882 | 0.911 | 0.853 | 0.891 | 0.902 | 0.928 | 0.857 | 0.892 |
| 1:9 | 0.899 | 0.943 | 0.869 | 0.915 | 0.858 | 0.897 | 0.933 | 0.862 | 0.897 | ||
| 0.7:9.3 | 0.947 | 0.886 | 0.917 | 0.909 | 0.939 | ||||||
| 0.5:9.5 | 0.901 |
| 0.872 | 0.884 | 0.857 | 0.902 | 0.903 | 0.864 | 0.899 | ||
Contribution of different combinations of feature classes for the AUC value on 10 cross validation batch datasets of Disorder723.
| Feature Class | AUC Value of 10 cross Validation Batch Datasets | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Mean | |
| Amino acid | 0.843 | 0.877 | 0.793 | 0.832 | 0.845 | 0.784 | 0.852 | 0.834 | 0.865 | 0.803 | 0.833 |
| Structural | 0.864 | 0.904 | 0.830 | 0.841 | 0.858 | 0.826 | 0.863 | 0.858 | 0.882 | 0.819 | 0.855 |
| Evolution | 0.874 | 0.920 | 0.836 | 0.857 | 0.879 | 0.832 | 0.876 | 0.880 | 0.908 | 0.834 | 0.870 |
| Amino acid + Evolution | 0.883 | 0.928 | 0.845 | 0.866 | 0.887 | 0.843 | 0.884 | 0.885 | 0.917 | 0.847 | 0.879 |
| Structural + Evolution | 0.895 | 0.935 | 0.868 | 0.873 | 0.901 | 0.850 | 0.897 | 0.896 | 0.924 | 0.859 | 0.889 |
| All features | |||||||||||
AUC value of several methods on 10 cross validation batch datasets of Disorder723.
| Methods | AUC Value of 10 cross Validation Batch Datasets | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Mean | |
| Iupred (long) | 0.747 | 0.764 | 0.645 | 0.758 | 0.727 | 0.702 | 0.732 | 0.694 | 0.747 | 0.689 | 0.721 |
| Iupred (short) | 0.821 | 0.857 | 0.756 | 0.826 | 0.823 | 0.752 | 0.840 | 0.787 | 0.839 | 0.795 | 0.810 |
| SPINE-D | 0.885 | 0.929 | 0.885 | 0.888 | 0.897 | 0.848 | 0.877 | 0.906 | 0.914 | 0.838 | 0.887 |
| DisoPred3 | 0.894 | 0.932 | 0.910 | 0.846 | 0.879 | 0.896 | 0.917 | 0.840 | 0.893 | ||
| DeepCNF-D | 0.875 | 0.886 | |||||||||
Performance of several predictors on CASP9. We show average value for balanced accuracy (bacc), precision, Mattehews correlation coefficient (MCC), and Area under the ROC curve (AUC).
| Predictor | Precision | Bacc | MCC | AUC |
|---|---|---|---|---|
| Iupred (long) | 0.238 | 0.546 | 0.118 | 0.567 |
| Iupred (short) | 0.433 | 0.698 | 0.342 | 0.657 |
| SPINE-D | 0.382 | 0.391 | 0.832 | |
| DisoPred3 | 0.704 | 0.464 | 0.842 | |
| DeepCNF-D | 0.598 | 0.752 | ||
| DeepCNF-D (ami_only) | 0.549 | 0.707 | 0.400 | 0.700 |
Performance of several predictors on CASP10. We show average value for precision, balanced accuracy (bacc), Mattehews correlation coefficient (MCC), and Area under the ROC curve (AUC).
| Predictor | Precision | Bacc | MCC | AUC |
|---|---|---|---|---|
| Iupred (long) | 0.231 | 0.575 | 0.145 | 0.621 |
| Iupred (short) | 0.413 | 0.729 | 0.374 | 0.712 |
| SPINE-D | 0.307 | 0.366 | 0.876 | |
| DisoPred3 | 0.719 | 0.467 | 0.883 | |
| DeepCNF-D | 0.529 | 0.764 | ||
| DeepCNF-D (ami_only) | 0.504 | 0.737 | 0.433 | 0.772 |
Figure 1The architecture of DeepCNF, where is the residue index, the associated input features, represents the kth hidden layer, and is the output label. All the layers from the 1st to the top layer form a deep convolutional neural network (DCNN). The top layer and the label layer form a conditional random field (CRF). , and are the model parameters where is used to model correlation among adjacent residues.
Figure 2The feed-forward connection between two adjacent layers in the deep convolutional neural network.