| Literature DB >> 17659089 |
Irena I Artamonova1, Goar Frishman, Dmitrij Frishman.
Abstract
BACKGROUND: Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17659089 PMCID: PMC1940032 DOI: 10.1186/1471-2105-8-261
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Annotation features used in this work
| Method used | |||||||||
| Feature name | Description | Examples | Algorithm | Threshold value used | Reference | Number of proteins having items of this type | Total number of items found | Average number of items per protein | Total number of attribute values |
| Length | Protein length (number of amino acids) binned over four ranges | Small (<120), Medium (>=120, <1000), Large (>=1000, <1500), eXtraLarge (>=1500) | Direct calculation | Not applicable | None | All (55063) | All (55063) | 1 | 4 |
| GC content of the gene | The value of the GC-content binned over 3 ranges | L (<=0.4), M (<0.5), H (>=0.5) | Direct calculation | Not applicable | None | 30218* | 30218 | 1 | 3 |
| Isoelectric point | The value of the isoelectric point binned over 4 ranges | C (aCid, pI <=5.5), NC (Neutral-aCid, 5.0 < pI <=7.0), NL (Neutral-aLkaline, 7.0 < pI <=9.2), L (aLkaline, pI > 9.2) | Direct calculation | Not applicable | None | All (55063) | All (55063) | 1 | 4 |
| Low complexity regions | Percentage of residues predicted to be in low complexity regions binned over three ranges | High (>=10%), Medium (0–10%), None (0%) | SEG | Default SEG parameters | (Wootton, 1994) | All (55063) | All (55063) | 1 | 3 |
| Disordered regions | Percentage of residues in disordered regions binned over 4 ranges | High (>=20%), Medium (10–20%), Low (0–10%), 0 (0%) | DisEMBL | Default DisEMBL parameters | (Linding et al., 2003) | All (55063) | All (55063) | 1 | 4 |
| Coiled coil regions | Presence of coiled coil regions | COILS:+ | COILS | Default COILS parameters | (Lupas, 1997) | 7809 | 7809 | 1 | 1 |
| Structural class derived from secondary structure prediction | Classification of proteins based on the prevalent type of secondary structure | Alpha/beta | Predator | Default Predator parameters | (Frishman and Argos, 1997) | 52711 | 52711 | 1 | 4 |
| Transmembrane segments | Presence and number of transmembrane segments | TM (=transmembrane domains are present), 1 TMs, 12 TMs (the number of TM domains) | TMHMM | Default TMHMM parameters | (Krogh et al., 2001) | 12437 | 24874 | 2 | 52 |
| Signal peptide | The presence of the signal peptide | SignalP:+ | SignalP | Default SignalP parameters | (Bendtsen et al., 2004) | 8066 | 8066 | 1 | 1 |
| Protein localiza-tion | Predicted cellular localization | Secretory pathway | TargetP | Default TargetP parameters | (Emanuelsson et al., 2000) | 18186 | 18186 | 1 | 2 |
* For technical reasons GC content values were not available for Arabidopsis thaliana genes at the time of writing.
Features transferred by similarity (type 3, see Methods) are shown in italic.
Figure 1Distribution of negative association rule strength (probability that a given database entry will satisfy the right side of the rule given that it satisfies the left side of the rule). Minimal coverage counts (number of entries in the database that possess all features from the left hand side of the rule) used are 100 (blue), 200 (pink), and 500 (green). The threshold for minimal leverage count (difference of the actual rule frequency and the probability to find it by chance with the given frequencies of its RHS and LHS) was set to 100 in all calculations
Figure 2Fraction of annotation terms corrected based on the taxonomic information among all rule exceptions. The number of all exceptions found in each strength interval is shown above each bar.
Figure 3Coverage of the negative and positive rule mining approaches. The numbers represent the percentage of all annotation features identified as potentially erroneous by each individual method and by both of them
Manual verification of randomly selected feature samples
| Number of verified terms | Classified as "errors" | Percentage of actual errors in the sample | |
| Flagged features | 203 | 150 | 74% |
| Non-flagged features | 798 | 430 | 54% |