| Literature DB >> 23496846 |
Satish M Srinivasan1, Suleyman Vural, Brian R King, Chittibabu Guda.
Abstract
BACKGROUND: In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23496846 PMCID: PMC3610217 DOI: 10.1186/1471-2105-14-96
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Location-wise distribution of full-length and Pfam-mapped protein sequences
| Cytoskeleton | CSK | 259 | 200 |
| Cytoplasm | CYT | 3334 | 2809 |
| Endoplasmic Reticulum | END | 1016 | 884 |
| Extracellular | EXC | 8666 | 6393 |
| Golgi apparatus | GOL | 291 | 248 |
| Lysosome | LYS | 159 | 138 |
| Mitochrondria | MIT | 2760 | 2383 |
| Nuclear | NUC | 5104 | 4221 |
| Plasma membrane | PLA | 6852 | 6155 |
| Perixosome | POX | 212 | 190 |
SL- Subcellular Localization; Pfam- Protein family database.
Figure 1Number of -grams generated before and after substitution as a function of -gram length at a selection threshold of 5.
Specificity and Sensitivity for different selection thresholds across different locations
| CYT | 0.974 | 0.98 | 0.979 | 0.983 | 0.985 | 0.985 | 0.865 | 0.895 | 0.888 | 0.907 | 0.921 | 0.912 |
| CSK | 0.997 | 0.998 | 0.997 | 0.998 | 0.998 | 0.998 | 0.73 | 0.771 | 0.769 | 0.801 | 0.826 | 0.822 |
| END | 0.985 | 0.988 | 0.987 | 0.989 | 0.99 | 0.99 | 0.93 | 0.95 | 0.951 | 0.962 | 0.97 | 0.97 |
| EXC | 0.973 | 0.979 | 0.979 | 0.982 | 0.985 | 0.985 | 0.899 | 0.924 | 0.923 | 0.939 | 0.949 | 0.95 |
| GOL | 0.996 | 0.997 | 0.996 | 0.997 | 0.997 | 0.997 | 0.983 | 0.996 | 0.994 | 0.998 | 0.999 | 0.998 |
| LYS | 0.998 | 0.998 | 0.998 | 0.999 | 0.999 | 0.999 | 0.987 | 0.993 | 0.987 | 0.993 | 0.997 | 0.988 |
| MIT | 0.982 | 0.985 | 0.984 | 0.987 | 0.988 | 0.988 | 0.877 | 0.898 | 0.879 | 0.895 | 0.906 | 0.900 |
| NUC | 0.977 | 0.981 | 0.98 | 0.984 | 0.986 | 0.986 | 0.749 | 0.807 | 0.806 | 0.844 | 0.871 | 0.871 |
| PLA | 0.961 | 0.97 | 0.968 | 0.974 | 0.978 | 0.977 | 0.878 | 0.897 | 0.893 | 0.908 | 0.919 | 0.919 |
| POX | 0.997 | 0.998 | 0.998 | 0.998 | 0.998 | 0.998 | 0.965 | 0.979 | 0.983 | 0.99 | 0.997 | 0.997 |
CYT – Cytoplasm; CSK – Cytoskeleton; END – Endoplasmic Reticulum; EXC – Extracellular/Secreted; GOL – Golgi; LYS – Lysosome; MIT – Mitochondria; NUC – Nuclear; PLA – Plasma membrane; POX – Perixosome.
Figure 2ROC curve showing the performance of the scoring function in predicting true positive and false positive -grams. CYT – Cytoplasm; CSK – Cytoskeleton; END – Endoplasmic Reticulum; EXC – Extracellular/Secreted; GOL – Golgi; LYS – Lysosome; MIT – Mitochondria; NUC – Nuclear; PLA – Plasma membrane; POX – Perixosome.
Figure 3ROC curve comparing the performance of SF1, Wordspy and SF2. The black oval ring indicates ROC values of SF1 and the orange oval ring indicates ROC values of Wordspy. The black circle indicates the ROC region for SF1; the blue circle indicates the ROC region for Wordspy and the blue crosses and orange dots indicates the ROC region for SF2. CSK – Cytoskeleton; GOL – Golgi; LYS – Lysosome; POX – Perixosome.
Figure 4Protein sequences with mapped discriminative -grams and non-discriminative regions masked with ‘X’.
Figure 5Average coverage across 50 prosite families for different selection thresholds. The blue line indicates the average coverage across 50 prosite families at different selection thresholds.
Figure 6A schematic diagram showing the methodology and scoring function. GOL – Golgi; MIT – Mitochondria; NUC – Nuclear.