| Literature DB >> 19416501 |
Antonis Koussounadis1, Oliver C Redfern, David T Jones.
Abstract
BACKGROUND: The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19416501 PMCID: PMC2688513 DOI: 10.1186/1471-2105-10-129
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of structure, text classifiers and logistic regression models in 'borderline' proteins from CATH
| SSAP + TEXT | 0.920 | 0.29 | 0.924 | 0.34 | 70021 | 0.917 | 0.28 |
| SSAP | 0.908 | 0.23 | 0.913 | 0.30 | 62182 | 0.905 | 0.22 |
| TEXT | 0.789 | 0.12 | 0.791 | 0.06 | 11814 | 0.788 | 0.12 |
The performance of the structural similarity measured by the SSAP algorithm and text similarity (TEXT) as classifiers for protein classification in the homologous superfamily level in the DC1.1993 dataset of 'borderline' cases in CATH using the textCATH as reference set. Classification performance was assessed using the AUC and MCC measures on the whole set (a), training (b), and test (c) sets. Nagelkerke's R2 is a measure of the variance accounted by the variables of the logistic regression models. Model L.R. stands for model likelihood chi-square which is the difference between Null and Residual deviance. Logistic regression models were trained on a random subset of comparisons from 1000 abstracts and tested on the remaining 993 abstracts from the 'borderline' cases dataset DC1.1993.
Coefficients and Wald tests for logistic regression model
| Intercept | -23.4406 | 0.093 | -251.91 | < 0.001 |
| SSAP | 0.2891 | 0.001 | 211.59 | < 0.001 |
| TEXT | 0.1254 | 0.001 | 96.58 | < 0.001 |
Coefficients and Wald Z statistics of the logistic regression model SSAP+TEXT that includes SSAP (structural similarity) and TEXT (text similarity) independent variables. Coeff = coefficient for the logistic regression; S.E. = standard error; Wald Z, p = Wald statistic with corresponding probability.
Figure 1Test ROC curves of the text similarity algorithm in the 'borderline' cases dataset. ROC curves of the test set from the 'borderline' cases DC1.1993 dataset for the TEXT (green), SSAP (black) classifiers and the logistic regression model that includes SSAP and TEXT independent variables (blue). The reference set was textCATH. The inset shows the same curves for low error rates (FPR < 0.10).
Figure 2Coverage versus error graphs. (i) Coverage (sensitivity) versus error graph. For each classifier, the scores of the comparisons between the query DC1.1993 'borderline' set and the reference textCATH set were sorted in decreasing order. The comparisons include both CATH superfamily classification matches (true positives TP) and non-matches (false positives FP). Descending from the top classifier score, the numbers of true and false positives are counted for each possible cutoff. Green: TEXT; black: SSAP; blue: SSAP + TEXT logistic regression model. (ii) Log of the fraction of true positives versus the log of the false positive rate (FPR) graph. The FPR is defined as the fraction of the total false positives for each score cutoff. The fraction of TP is the proportion of the total number of TPs (see text).
Number of classification matches at various rates of false positives in the 'borderline' DC1.1993 dataset
| Errors | CATH superfamily classification matches (TP) | ||||||
| TEXT | SSAP | SSAP + TEXT | |||||
| False Positive Rate | Number of errors | Coverage | Cutoff | Coverage | Cutoff | Coverage | Cutoff |
| 10-5 | 31 | 8; 0.04 | 77.70 | 16; 0.09 | 79.94 | 98; 0.58 | 0.9808 |
| 10-4 | 306 | 96; 0.57 | 48.86 | 229; 1.36 | 79.40 | 585; 3.48 | 0.6792 |
| 10-3 | 3060 | 707; 4.21 | 20.75 | 1677; 10.00 | 76.66 | 2571; 15.33 | 0.2982 |
| 10-2 | 30598 | 3036; 18.10 | 7.83 | 5808; 34.64 | 71.24 | 6901; 41.16 | 0.0706 |
Coverage is the fraction of true classification matches and is shown as actual numbers and as a percentage of total TP (%). Scores range between 1 and 100, 30 and 80, and 0 and 1 for the TEXT, SSAP and SSAP + TEXT classifiers, respectively. Total comparisons: 3076606, positive matches: 16765.
REGS352 and PDB145 datasets
| Amidohydrolase | 87 | 26 | 73 | 11 | 41 |
| Crotonase | 50 | 16 | 36 | 7 | 14 |
| Enolase | 85 | 9 | 66 | 8 | 39 |
| Haloacid Dehalogenase | 104 | 19 | 93 | 10 | 30 |
| VOC | 95 | 17 | 84 | 7 | 21 |
| TOTAL | 421 | 87 | 352 | 43 | 145 |
Distribution of the gold dataset sequences and the derived datasets REGS352 and PDB145 among the five superfamilies of the gold dataset. (Brown et al ., 2006)
Example sentences used in training of the SVM model
| Sequence analysis showed that | + |
| The packing of the octameric enzyme in this crystal form is unusual, because the asymmetric unit contains three subunits. | + |
| Cys-592, which is essential for enzymatic activity, is located in the above-mentioned histidine-rich region. | + |
| From the significant sequence similarity between | + |
| Two | + |
| A thermostable | - |
| A | - |
| The K+ ion activates the enzyme 100-fold with an activation constant of 6 mM, well below the physiologic concentration of K+ in E. coli. | - |
| A putative regulator and its possible recognition site was suggested on the basis of homology data. | - |
| The enzyme has a subunit Mr of 33,500 +/- 2000 by SDS/polyacrylamide-gel electrophoresis. | - |
Sample positive and negative sentences manually classified by an expert biologist for their content on functional, structural and classification information and used as training examples to learn an SVM model. Terms in italics were removed prior to training.
Example sentences used in testing of the SVM model
| These homologous proteins, designated the "enolase superfamily", include enolase as well as more metabolically specialized enzymes: mandelate racemase, galactonate dehydratase, glucarate dehydratase, muconate-lactonizing enzymes, N-acylamino acid racemase, beta-methylaspartate ammonia-lyase, and o-succinylbenzoate synthase. | 3.99 |
| GlucD is a member of the mandelate racemase (MR) subgroup of the enolase superfamily, the members of which catalyze reactions that are initiated by abstraction of the alpha-proton of a carboxylate anion substrate. | 3.42 |
| The structure of Neurospora crassa 3-carboxy-cis, cis-muconate lactonizing enzyme, a beta propeller cycloisomerase. | 1.41 |
| The corresponding cDNA was amplified from a library of lobster muscle cDNA, and a sequence corresponding to residues 27–398 was determined. | -1.10 |
| The values for kcat were reduced 4.5 × 10(3)-fold for (R)-®delate and 2.9 × 10(4)-fold for (S)-mandelate; the values for kcat/Km were reduced 3 × 10(4)-fold. | -3.31 |
Sample positive and negative test examples classified and scored by the SVM model.
Classifier performance in the enzyme dataset
| 1 | Abstract | Dp20 – DX33 -Ann | Stop | 0.75 | 0.51 | 0.56 |
| 2 | Annotations | Dp20 – DX33 -Ann | Stop | 0.77 | 0.53 | 0.58 |
| 3 | Abstract | Dp20 – Ann | Stop | 0.74 | 0.50 | 0.53 |
| 4 | Abstract | Dp20 – Ann | Standard | 0.70 | 0.33 | 0.44 |
| 5 | Abstract | Dp20 | Stop | 0.74 | 0.49 | 0.52 |
| 6 | Abstract | Dp1 | Stop | 0.64 | 0.31 | 0.40 |
Classifier performance was assessed using AUC, MCC and F-measure under six conditions of the reference set PDB145: Inclusion of additional abstracts from related articles (Dp20); inclusion of annotations (Ann); filtering using the SVM model (DX33). Conditions 1 and 2 : 20 SVM filtered abstracts per enzyme, Stop analyser, inclusion of PDB/UniProt annotations.Condition 3 : 20 abstracts per enzyme, Stop analyser, inclusion of PDB/UniProt annotations. Condition 4 : 20 abstracts per enzyme, Standard analyser, inclusion of PDB/UniProt annotations. Condition 5 : 20 abstracts per enzyme, Stop analyser. Condition 6 : 1 abstract per enzyme, Stop analyser. For the Superfamily classification task all 352 enzymes of REGS352 were classified in 5 superfamilies. Abstracts were used in the test set, except of condition 2 were annotations from UniProt and PDB fields were used for comparison.
Figure 3ROC curves and precision-recall graphs of the text similarity algorithm in the enzyme dataset. (A) ROC curves and (B) Precision-Recall graphs of conditions 1–6 (Table 7) in the enzyme dataset. Condition 1, black solid line; Condition 2, black dashed line; Condition 3, red solid line; Condition 4, red dashed line; Condition 5, blue solid line; Condition 6, blue dashed line.