| Literature DB >> 17683567 |
Carson Andorf, Drena Dobbs, Vasant Honavar.
Abstract
BACKGROUND: Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17683567 PMCID: PMC1994202 DOI: 10.1186/1471-2105-8-284
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of classifiers trained on human versus mouse kinases in predicting AmiGO annotations. The performance measures accuracy, kappa coefficient, correlation coefficient, precision, and recall are reported for two of the HDTree classifiers. The first classifier is trained on 330 human kinases. The performance is based on 10-fold cross-validation. The second classifier is trained on the 330 human kinases and tested on 244 mouse kinases. The annotations for the mouse and human kinases were obtained from AmiGO.
| 89.1 | 0.76 | 0.82 | 0.86 | 0.30 | 0.97 | 1.00 | 0.15 | 0.95 | 0.74 | 0.71 | |
| 15.1 | -0.40 | -0.40 | -0.43 | -0.01 | 0.17 | 0.11 | 0.25 | 0.41 | 0.07 | 0.01 | |
Figure 1Distribution of Ser/Thr, Tyr, and dual specificity kinases among annotated protein kinases in human versus mouse genomes [see Additional file 9]. Pie charts illustrate the functional family distribution of protein kinases in human (top) versus mouse (bottom), based on: a. AmiGO functional classifications: Ser/Thr (GO0004674) [Blue]; Tyr (GO0004713) [Red] or "dual specificity" (proteins with both GO classifications) [Yellow]. b. UniProt annotations: classification based on UniProt records containing the key words Ser/Thr [Blue], Tyr [Red], or dual specificity [Yellow] [see Additional file 2]. c. Predicted annotations by the HDTree classifier: The classifier was built on human proteins with functional labels Ser/Thr (GO0004674) [Blue], Tyr (GO0004713) [Red] or "dual specificity" [Yellow] derived from AmiGO and verified by UniProt [see Additional file 4].
Figure 2Comparison of UniProt annotations of mouse protein kinase sequences with annotations from AmiGO or predicted by HDTree. The bar charts illustrate the number of proteins that were in agreement (blue)/disagreement (red) with the annotations found in UniProt. Proteins that belong to each of the three functional classes found in the UniProt records are represented by two bars. The blue bar represents the number of proteins in which UniProt and the given method share the same annotation (agreement) for that function. The red bar represents the number of proteins in which UniProt and the given method have different annotations (disagreement) for that function. a. AmiGO vs. UniProt annotations b. HDTree predictions vs. UniProt annotations [see Additional files 3 and 4].
Comparison of AmiGO and UniProt annotations for 211 mouse protein kinases with RCA Evidence code. Each of the 211 mouse kinase proteins with an RCA evidence code used in this study has both an AmiGO and a UniProt annotation. This table shows the number of proteins that have each of the nine possible combinations of AmiGO and UniProt annotations. Each row of the table represents one of the three possible UniProt labels and each column represents each of the three AmiGO annotations. Each entry of the table shows the number of proteins with the corresponding annotation. Note that all entries along the diagonal (in bold) show the number of proteins for which the AmiGO and UniProt annotations were in agreement. All other entries show the number of proteins where AmiGO and UniProt were in disagreement [see Additional files 2 and 3].
| 105 | 35 | ||
| 54 | 3 | ||
| 0 | 4 |
Comparison of performance of classifiers based on AmiGO annotations and UniProt annotations. The performance measures accuracy, kappa coefficient, correlation coefficient, precision, and recall are reported for two of the HDTree classifiers. Both classifiers were trained on 330 human kinases and tested on 211 mouse kinases with RCA evidence codes in AmiGO. The first classifier was trained and tested with annotations provided by UniProt and the second classifier used annotations obtained from AmiGO.
| 97.1 | 0.93 | 0.98 | 0.94 | 0.00 | 0.97 | 0.97 | 0.00 | 0.99 | 1.00 | 0.00 | |
| 4.2 | -0.37 | -0.64 | -0.85 | 0.00 | 0.06 | 0.00 | 0.00 | 0.14 | 0.00 | 0.00 | |
Classification schema for Classier #3 (Method for predicting dual specificity kinases). HDTree Classifier #3 uses the outputs from HDTree Classifier #1 and HDTree Classifier #2 to distinguish between dual-specificity kinases, Ser/Thr kinases, and Tyr kinases. There are four possible labelings from the binary classifiers #1 and #2. 'Yes' or 'No' votes from Classifier #1 correspond to predictions of Ser/Thr or Tyr labels, respectively, for the protein. 'Yes' or 'No' votes from Classifier #2 correspond to predictions of Tyr or Ser/Thr labels. When both classifiers predict the protein to be Ser/Thr (that is, Classifier #1 votes 'Yes' and Classifier #2 votes 'No'), Classifier #3 labels the protein as "exclusively Ser/Thr" (and hence, not Tyr). Similarly, when both classifiers predict the protein to be Tyr, Classifier #3 labels the protein as "exclusively Tyr" (and hence not Ser/Thr). When both classifiers vote 'Yes' or when both vote 'No,' Classifier #3 labels the protein as having "Dual" catalytic activity. See Methods section for details on each classifier.
| Prediction of classifier #1 (Ser/Thr) | Prediction of classifier #2 (Tyr) | |
| Yes | Yes | |
| Yes | No | |
| No | Yes | |
| No | No |
Performance measure definitions for binary classification. The performance measures accuracy, precision, recall, correlation coefficient, and kappa coefficientare used to evaluate the performance of our machine learning approaches [45]. Accuracy is the fraction of overall predictions that are correct. Precision is the ratio of predicted true positive examples to the total number of actual positive examples. Recall is the ratio of predicted true positives to the total number of examples predicted as positive. Correlation coefficient measures the correlation between predictions and actual class labels. Kappa coefficient is used as a measure of agreement between two random variables (predictions and actual class labels). The table summarizes the definitions of performance measures in the 2-class setting (binary classification), where M = the total number of classes and N = the total number of examples. TP, TN, FP, FN are the true positives, true negatives, false positives, and false negatives for the given confusion matrix.
| Performance Measure | Definition |
| Accuracy | |
| Precision | |
| Recall | |
| Correlation Coefficient | |
| Kappa Coefficient |
Performance measure definitions for multi-class classification. The performance measures accuracy, precision, recall, correlation coefficient, and kappa coefficient are used to evaluate the performance of our machine learning approaches [45]. Accuracy is the fraction of overall predictions that are correct. Precision is the ratio of predicted true positive examples to the total number of actual positive examples. Recall is the ratio of predicted true positives to the total number of examples predicted as positive. Correlation coefficient measures the correlation between predictions and actual class labels. Kappa coefficient is used as a measure of agreement between two random variables (predictions and actual class labels). The table displays the general definition of each measure, where M = the total number of classes and N = the total number of examples, xrepresents the number of examples in row i and column k of the given confusion matrix.
| Performance Measure | Definition |
| Accuracy (class | |
| Precision (class | |
| Recall (class | |
| Correlation Coefficient (class | |
| Kappa Coefficient |