| Literature DB >> 27390860 |
Imane Boudellioua1, Rabie Saidi2, Robert Hoehndorf1, Maria J Martin2, Victor Solovyev3.
Abstract
The widening gap between known proteins and their functions has encouraged the development of methods to automatically infer annotations. Automatic functional annotation of proteins is expected to meet the conflicting requirements of maximizing annotation coverage, while minimizing erroneous functional assignments. This trade-off imposes a great challenge in designing intelligent systems to tackle the problem of automatic protein annotation. In this work, we present a system that utilizes rule mining techniques to predict metabolic pathways in prokaryotes. The resulting knowledge represents predictive models that assign pathway involvement to UniProtKB entries. We carried out an evaluation study of our system performance using cross-validation technique. We found that it achieved very promising results in pathway identification with an F1-measure of 0.982 and an AUC of 0.987. Our prediction models were then successfully applied to 6.2 million UniProtKB/TrEMBL reference proteome entries of prokaryotes. As a result, 663,724 entries were covered, where 436,510 of them lacked any previous pathway annotations.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27390860 PMCID: PMC4938425 DOI: 10.1371/journal.pone.0158896
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Current status in UniProtKB for prokaryotes.
| Swiss-Prot | TrEMBL | |
|---|---|---|
| Total number of entries | 351,649 | 34,356,770 |
| Percentage of entries with pathway annotations | 30.44% | 5.22% |
| Percentage of entries with InterPro annotations | 98.76% | 76.17% |
As of November 2015.
Considered evidences for pathway annotation in UniProtKB/Swiss-Prot.
| Evidence ID | Evidence Label | Description |
|---|---|---|
| ECO:0000269 | Experimental evidence | Manually curated information for which there is published experimental evidence. |
| ECO:0000303 | Non-traceable author statement evidence | Manually curated information that is based on statements in scientific articles for which there is no experimental support. |
| ECO:0000305 | Curator inference evidence | Manually curated information which has been inferred by a curator based on his/her scientific knowledge or on the scientific content of an article. |
| ECO:0000250 | Sequence similarity evidence | Manually curated information which has been propagated from a related experimentally characterized protein. |
| ECO:0000255 | Sequence model evidence | Manually curated information which has been generated by the UniProtKB automatic annotation system or by various sequence analysis programs that are used during the manual curation process and which has been verified by a curator. |
| ECO:0000244 | Combinatorial evidence | Manually curated Information inferred from a combination of experimental and computational evidence. |
Examples of itemsets corresponding to some UniProt/Swiss-Prot entries of some prokaryotes with manual assertion evidence for pathway annotations.
| Entry ID | Corresponding Itemset |
|---|---|
| Q8TRZ4 | PATHWAY: One-carbon metabolism; methanogenesis from acetate, TAXON:Archaea, TAXON: Euryarchaeota, TAXON: Methanomicrobia, TAXON:Methanosarcinales, TAXON: Methanosarcinaceae, TAXON: Methanosarcina, IPR: IPR017896, IPR:IPR017900, IPR: IPR004460, IPR:IPR004137, IPR: IPR009051, IPR: IPR011254, IPR: IPR016099 |
| P18335 | PATHWAY: Amino-acid biosynthesis; L-arginine biosynthesis; N(2)-acetyl-L-ornithine from L-glutamate: step 4/4, PATHWAY: Amino-acid biosynthesis; L-lysine biosynthesis via DAP pathway; LL-2, 6-diaminopimelate from (S)-tetrahydrodipicolinate (succinylase route): step 2/3, TAXON: Bacteria, TAXON: Proteobacteria, TAXON: Gammaproteobacteria, TAXON: Enterobacteriales, TAXON: Enterobacteriaceae, TAXON: Escherichia, IPR:IPR017652, IPR: IPR004636, IPR:IPR005814, IPR: IPR015424, IPR:IPR015421, IPR: IPR015422 |
Apriori threshold values considered for the system.
| Parameter | Value |
|---|---|
| Minimum number of items per association rule | 2 |
| Minimum support of an itemset (absolute number of transactions) | 20 |
| Minimum confidence of a rule as a percentage | 100% |
Examples of rules generated by Apriori along with their evaluation measures for UniProt/Swiss-Prot prokaryotic entries with manual assertion evidence for pathway annotations.
| Consequent | Antecedent | Support | Conf. | Lift | p-value |
|---|---|---|---|---|---|
| PATHWAY:Cofactor biosynthesis; adenosylcobalamin biosynthesis | IPR:IPR003705 | 3.24709e-04 | 1 | 90.5787 | 6.47155e-63 |
| PATHWAY:tRNA modification; archaeosine-tRNA biosynthesis |
IPR:IPR004804 IPR:IPR002616 TAXON:Archaea | 3.35184e-04 | 1 | 2983.44 | 2.72224e-127 |
| PATHWAY:Amino-acid biosynthesis; L-leucine biosynthesis; L-leucine from 3-methyl-2-oxobutanoate: step 2/4 |
IPR:IPR004430 IPR:IPR018136 IPR:IPR001030 TAXON:Enterobacteriacea TAXON:Proteobacteria TAXON:Bacteria | 8.06536e-04 | 1 | 94.6184 | 1.07237e-155 |
Examples of prediction models obtained in the form or aggregated rules along with their evaluation measures for UniProt/Swiss-Prot prokaryotic entries with manual assertion evidence for pathway annotations.
Each rule is accompanied by its four evaluation measures and its Euclidean distance to normalized ideal metrics.
Performance evaluation of our system as illustrated by a confusion matrix.
Results are averaged over two-run five-fold-cross-validation along with the corresponding deviation values (±d) from the observed values of the two runs.
| Positive prediction | Negative prediction | |
|---|---|---|
| Actually positive | TP = 109,136 ± 47 | FN = 3824 ± 53 |
| Actually negative | FP = 22 ± 2 | TN = 33,180,542 ± 1,015 |
Evaluation metrics of cross-validation experiment over UniProtKB/Swiss-Prot prokaryotic entries with pathway annotations of manual assertion evidence.
| Metric | Value |
|---|---|
| Precision | 0.999 |
| Recall | 0.966 |
| F1-measure | 0.982 |
| AUC | 0.987 |
Current status in UniProtKB/TrEMBL for prokaryotic reference proteome set.
| TrEMBL | |
|---|---|
| Total number of entries | 6,193,540 |
| Percentage of entries with pathway annotations | 3.67% |
| Percentage of entries with InterPro annotations | 80.68% |
As of November 2015.
Fig 1Annotation coverage for UniProtKB/TrEMBL reference proteome prokaryotic entries.
(a) represents entries we could cover which lack pathway annotation, (b) represents entries we could cover which already have pathway annotation, and (c) represents entries we could not cover which already have pathway annotation.
Overview of HAMAP-Rule, Rule-Base, and SAAS.
| System | Methodology | Evaluation methodology |
|---|---|---|
| HAMAP-Rule | Semi-automated/manual: rules are created by bio-curators and applied automatically | Bio-curator expertise: |
| Rule-Base | Semi-automated/manual: rules are created by bio-curators, statistically validated and applied automatically. | Bio-curator expertise + Each rule must have a confidence of more than 95%. |
| SAAS | Automated: rules are created by a C4.5 decision tree algorithm and applied automatically. | Each rule must have a confidence of more than 95%. |
Fig 2Comparison of annotation coverage of UniProtKB/TrEMBL reference proteome prokaryotic entries with three main automatic annotation systems present in UniProtKB/TrEMBL which are SAAS, HAMAP-Rule, and Rule-base.
Fig 3Comparison of predictions applied on UniProtKB/TrEMBL reference proteome prokaryotic entries relative to three main automatic annotation systems present in UniProtKB/TrEMBL which are HAMAP-Rule, SAAS and Rule-base.
Fig 4Comparison of predictions corresponding to UniProtKB/TrEMBL reference proteome prokaryotic entries touched by our system, Hamap-Rule, SAAS, and Rule-base.