| Literature DB >> 24587103 |
Alok Sharma1, Abdollah Dehzangi2, James Lyons3, Seiya Imoto4, Satoru Miyano4, Kenta Nakai4, Ashwini Patil4.
Abstract
With the exponential increase in the number of sequenced organisms, automated annotation of proteins is becoming increasingly important. Intrinsically disordered regions are known to play a significant role in protein function. Despite their abundance, especially in eukaryotes, they are rarely used to inform function prediction systems. In this study, we extracted seven sequence features in intrinsically disordered regions and developed a scheme to use them to predict Gene Ontology Slim terms associated with proteins. We evaluated the function prediction performance of each feature. Our results indicate that the residue composition based features have the highest precision while bigram probabilities, based on sequence profiles of intrinsically disordered regions obtained from PSIBlast, have the highest recall. Amino acid bigrams and features based on secondary structure show an intermediate level of precision and recall. Almost all features showed a high prediction performance for GO Slim terms related to extracellular matrix, nucleus, RNA and DNA binding. However, feature performance varied significantly for different GO Slim terms emphasizing the need for a unique classifier optimized for the prediction of each functional term. These findings provide a first comprehensive and quantitative evaluation of sequence features in intrinsically disordered regions and will help in the development of a more informative protein function predictor.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24587103 PMCID: PMC3933697 DOI: 10.1371/journal.pone.0089890
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Features extracted from sequences of intrinsically disordered regions.
| Feature | Reference | Dimensions |
| Chemical composition | Moesa et al., 2012 | 5 |
| Amino acid composition | Patil et al., 2012 | 20 |
| Composition+Dubchak features | Ding and Dubchak, 2001 | 125 |
| Occurrence+Dubchak features | Taguchi and Gromiha, 2007 | 125 |
| Sequence bigrams | Ghanty and Pal, 2009 | 400 |
| Alternate bigrams | Ghanty and Pal, 2009 | 400 |
| Profile bigrams | Sharma et al., 2013 | 400 |
Figure 1Schematic diagram of the prediction of protein function using features of intrinsically disordered regions with m-pairwise classifiers.
Features F are extracted from proteins P that have IDRs R. Protein GO Slim terms c are assigned to IDRs. A single pairwise classifier is trained for each of the m GO Slim terms. The classifier is used to predict a GO Slim term for a protein using features of each IDR it contains.
Average performance of sequence features in IDRs using 10-fold cross-validation for all 130 GO Slim terms tested.
| Feature | Recall/Sensitivity | Specificity | Precision |
| Chemical composition | 1.81 | 99.47 | 11.32 |
| Amino acid composition | 6.07 | 97.54 | 10.02 |
| Composition+Dubchak features | 30.41 | 80.68 | 8.08 |
| Occurrence+Dubchak features | 33.29 | 77.66 | 7.90 |
| Alternate bigrams | 37.04 | 73.52 | 8.50 |
| Sequence bigrams | 38.77 | 73.21 | 8.61 |
| Profile bigrams | 50.39 | 59.22 | 7.13 |
Figure 2Precision-recall plots comparing the performance of the sequence feature classifiers.
Precision-recall curves for the prediction of GO Slim terms by Naïve Bayes classifier using 7 sequence features of IDRs. Abbreviations used: AA – amino acid, Chem – chemical, Comp – composition, Occu – occurrence.
MCC values for 7 IDR sequence features for the top 30 GO Slim terms predicted by at least 4 features with MCC >0.05.
| GO Slimterm | Description | Chemicalcomposition | Aminoacidcomposition | Composition+Dubchak | Occurrence+Dubchak | Sequencebigrams | Alternatebigrams | Profilebigrams |
| GO:0005578 | Proteinaceous extracellularmatrix | 0.288 | 0.346 | 0.212 | 0.103 | 0.182 | 0.171 | 0.222 |
| GO:0030198 | Extracellular matrixorganization | 0.302 | 0.210 | 0.121 | 0.131 | 0.220 | 0.204 | 0.121 |
| GO:0005576 | Extracellular region | 0.215 | 0.266 | 0.174 | 0.098 | 0.149 | 0.163 | 0.162 |
| GO:0006397 | mRNA processing | 0.178 | 0.189 | 0.139 | 0.146 | 0.176 | 0.152 | 0.044 |
| GO:0005198 | Structural molecule activity | 0.171 | 0.203 | 0.113 | 0.133 | 0.126 | 0.117 | 0.102 |
| GO:0001071 | Nucleic acid bindingtranscriptionfactor activity | −0.005 | 0.149 | 0.148 | 0.133 | 0.190 | 0.175 | 0.138 |
| GO:0005634 | Nucleus | 0.093 | 0.108 | 0.131 | 0.140 | 0.144 | 0.134 | 0.088 |
| GO:0048856 | Anatomical structuredevelopment | 0.071 | 0.113 | 0.124 | 0.116 | 0.132 | 0.120 | 0.110 |
| GO:0003723 | RNA binding | 0.077 | 0.113 | 0.116 | 0.117 | 0.118 | 0.095 | 0.102 |
| GO:0003677 | DNA binding | 0.014 | 0.079 | 0.118 | 0.118 | 0.141 | 0.122 | 0.098 |
| GO:0005783 | Endoplasmic reticulum | 0.169 | 0.137 | 0.070 | 0.044 | 0.084 | 0.096 | 0.072 |
| GO:0034641 | Cellular nitrogen compoundmetabolic process | 0.096 | 0.094 | 0.079 | 0.079 | 0.114 | 0.107 | 0.056 |
| GO:0009790 | Embryo development | – | 0.067 | 0.119 | 0.104 | 0.114 | 0.118 | 0.099 |
| GO:0048646 | Anatomical structureformation involved inmorphogenesis | 0.040 | 0.079 | 0.096 | 0.074 | 0.102 | 0.100 | 0.085 |
| GO:0030154 | Cell differentiation | 0.030 | 0.065 | 0.109 | 0.091 | 0.093 | 0.080 | 0.091 |
| GO:0051276 | Chromosome organization | 0.024 | 0.084 | 0.100 | 0.107 | 0.096 | 0.089 | 0.058 |
| GO:0042254 | Ribosome biogenesis | 0.034 | 0.067 | 0.113 | 0.106 | 0.084 | 0.079 | 0.060 |
| GO:0016887 | ATPase activity | 0.040 | 0.046 | 0.100 | 0.100 | 0.092 | 0.098 | 0.056 |
| GO:0005730 | Nucleolus | 0.038 | 0.064 | 0.104 | 0.096 | 0.074 | 0.068 | 0.071 |
| GO:0006259 | DNA metabolic process | 0.013 | 0.065 | 0.105 | 0.095 | 0.099 | 0.081 | 0.040 |
| GO:0034655 | Nucleobase-containingcompound catabolic process | 0.057 | 0.062 | 0.098 | 0.095 | 0.071 | 0.083 | 0.029 |
| GO:0005654 | Nucleoplasm | 0.060 | 0.055 | 0.046 | 0.047 | 0.088 | 0.100 | 0.072 |
| GO:0005694 | Chromosome | 0.014 | 0.069 | 0.090 | 0.078 | 0.085 | 0.082 | 0.046 |
| GO:0051082 | Unfolded protein binding | 0.082 | 0.083 | 0.089 | 0.080 | 0.048 | 0.039 | 0.023 |
| GO:0022618 | Ribonucleoproteincomplex assembly | 0.000 | 0.127 | 0.078 | 0.066 | 0.063 | 0.062 | 0.047 |
| GO:0005886 | Plasma membrane | – | 0.033 | 0.055 | 0.085 | 0.096 | 0.102 | 0.068 |
| GO:0004386 | Helicase activity | 0.006 | 0.039 | 0.094 | 0.097 | 0.084 | 0.078 | 0.033 |
| GO:0007165 | Signal transduction | – | 0.045 | 0.079 | 0.073 | 0.072 | 0.074 | 0.076 |
| GO:0007049 | Cell cycle | −0.008 | 0.071 | 0.075 | 0.073 | 0.076 | 0.078 | 0.049 |
| GO:0042393 | Histone binding | −0.007 | 0.080 | 0.082 | 0.059 | 0.071 | 0.068 | 0.033 |