| Literature DB >> 19758468 |
Kevin Nagel1, Antonio Jimeno-Yepes, Dietrich Rebholz-Schuhmann.
Abstract
BACKGROUND: A protein annotation database, such as the Universal Protein Resource knowledge base (UniProtKb), is a valuable resource for the validation and interpretation of predicted 3D structure patterns in proteins. Existing studies have focussed on point mutation extraction methods from biomedical literature which can be used to support the time consuming work of manual database curation. However, these methods were limited to point mutation extraction and do not extract features for the annotation of proteins at the residue level.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19758468 PMCID: PMC2745586 DOI: 10.1186/1471-2105-10-S8-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of text mining processes and evaluation methods for the extraction of functional annotation. The presented functional annotation extraction system consists of two major text mining processes: protein residue identification (left hand side), and contextual feature extraction (right hand side). The extracted annotations are compared with information in the feature table from UniProtKb.
Regular expression patterns for the detection of residue mentions in text. The patterns recognise single (SITE) or multiple wild-type residue sites (SITES), a sequence range or residue pair (RANGE/PAIR), and point mutation (MUTATION). The set covers abbreviated notations of residues as well as grammatic expressions found in text.
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| |
Rule set for shallow parsing. The rules are used to identify general verbal and prepositional relations between noun phrases in text. N is a noun, Det a determiner, Adj an adjective, Adv an adverb, P a preposition, NP a noun phrase, PP a prepositional phrase, VP a verb phrase, VG a verb group, and REL is the target relation. Notice, that the grammar does not consider coordinating conjunctions, e.g. with "and", "or" and ",".
| NP = Det? (Adj|Adv|N)* N |
| PP = P NP |
| VG = (Adv|Aux|V|InfTo)* V |
| VP = VG NP PP* |
| REL = NP PP* VP |
Biological catagories for the interpretation of functional annotations. The interpretation of extracted annotations is based on the automatic assignment of semantic labels to the arguments of a PAS. Because a comprehensive ontology is not available two categorisation schema are tested in this study. The first is the design of a scheme (MAN) based on an analysis of relevant MEDLINE sentences for residue annotation (bottom-up approach). Alternatively, the categories in the feature table of UniProtKb (FEAT) can be reused (top-down approach). Both categorisation schemes reflect concepts of biological interest. However the bottom-up approach has the advantage that proposed categories are data-driven, while in a top-down approach examples of listed categories may not be present in natural language text, or other categories are missing in the scheme.
| MAN | FEAT | ||
| Category | Defintion | Category | Defintion |
| STR_COMP | Structure component. Class denoting concepts that represent pieces and parts of the protein structure. | DOMAIN | Extent of a domain, which is defined as a specific combination of secondary structures organised into a characteristic three-dimensional structure of fold. |
| MOTIF | Short (up to 20 amino acids) sequence motif of biological interest. | ||
| TOPO_DOM | Topological domain. | ||
| CHAIN | Extent of a polypeptide chain in the mature protein. | ||
| TRANSMEM | Extent of a transmembrane region. | ||
| COILED | Extent of a coiled-coil region. | ||
| CHEM_MOD | Chemical modification. Class denoting changes to the protein sequence and the chemical composition. | VARIANT | Authors report that sequence variants exist. |
| MOD_RES | Posttranslational modification of a residue. | ||
| PEPTIDE | Extent of a released active peptide. | ||
| VAR_SEQ | Description of sequence variants produced by alternative splicing, alternative promoter usage, alternative initiation and ribosomal frameshifting. | ||
| LIPID | Covalent binding of a lipid moiety. | ||
| CARBOHYD | Glycosylation site. | ||
| STR_MOD | Structural modification. Class denoting the changes to the protein structure without changes to the chemical composition. | REGION | Extent of a region of interest in the sequence. |
| SITE | Any interesting single amino-acid site on the sequence, that is not defined by another feature key. | ||
| BINDING | Binding type. Class denoting different physico-chemical forces leading to a bond formation between a protein structure component and a chemical entity. | BINDING | Binding site for any chemical group (co-enzyme, prosthetic group, etc.). |
| METAL | Binding site for a metal ion. | ||
| DISULFID | Disulfide bond. | ||
| CROSSLNK | Posttranslationally formed amino acid bonds. | ||
| DNA_BIND | Extent of a DNA-binding region. | ||
| NP_BIND | Extent of a nucleotide phosphate-binding region. | ||
| ZN_FING | Extent of a zinc finger region. | ||
| CA_BIND | Extent of a calcium-binding region. | ||
| ENZ_ACT | Enzymatic activity. Types of enzymatic reactions as a subpart to protein functions. | ACT_SITE | Amino acid(s) involved in the activity of an enzyme. |
| CELL | Cellular phenotype. Class denoting different cellular phenotypes that can be affected by structural or compositional changes of a protein. | N/A | |
Test corpora for information extraction evaluation. Based on the citation references from UniProtKb a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora were derived from this corpus: the gold standard corpus (GC), which resembles a manually annotated test set, and the cross-validation corpus (XC), which contains automatically assigned annotations based on information from UniProtKb.
| Dataset | Gold standard corpus (GC) | Cross-validation corpus (XC1) | Cross-validation corpus (XC2) |
| Abstracts count | 100 | 55,998 | 5,253 |
| Method of annotation | manual | automatic | automatic |
| total/unique residues | 362/262 (with 262/191 having residue name + residue sequence position) | N/A | N/A |
| total/unique proteins | 990/511 | N/A | N/A |
| total/unique organisms | 323/123 | N/A | N/A |
| total/unique associations | 240/172 residue-protein-organism associations | NA/70,401 protein-organism as UTP | NA/68,008 protein-residue as URP |
| Application | Test the the type, amount and reliability of the extracted information (reproduction of manually annotated information). | Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database. | Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database. |
Figure 2Cross-validation of citations from identified protein residues with UniProtKb/PDB. For a subset of identified UniProtKb/PDB proteins (i.e. proteins with UniprotID and PDBID) in MEDLINE, the determined PubMed identifiers (PMIDs) can be cross-validated with the relevant citation set from UniProtKb. uni = UniProtKb/PDB based citations; med = protein residue identification based citations; comm = common set of citations between uni and med.
Performance evaluation of the classifiers (precision, recall, F1 measure). Classification with categories from MAN and FEAT were analysed by a 100 times 5-fold cross-validation.
| MAN | FEAT | ||||||
| Category | Precision | Recall | F1 | Category | Precision | Recall | F1 |
| STR_COMP | 0.56 | 0.69 | 0.62 | DOMAIN | 0.50 | 0.24 | 0.32 |
| MOTIF | 0.98 | 0.36 | 0.53 | ||||
| TOPO_DOM | 0 | 0 | 0 | ||||
| CHAIN | 0 | 0 | 0 | ||||
| TRANSMEM | 0 | 0 | 0 | ||||
| COIL | 0 | 0 | 0 | ||||
| CHEM_MOD | 0.54 | 0.59 | 0.57 | VARIANT | 0.50 | 0.69 | 0.58 |
| MOD_RES | 0.40 | 0.23 | 0.29 | ||||
| PEPTIDE | 0.05 | 0.06 | 0.05 | ||||
| VAR_SEQ | 0 | 0 | 0 | ||||
| LIPID | 1 | 0.32 | 0.48 | ||||
| CARBOHYD | 0 | 0 | 0 | ||||
| STR_MOD | 0.24 | 0.10 | 0.15 | REGION | 0.44 | 0.44 | 0.44 |
| SITE | 0.40 | 0.55 | 0.46 | ||||
| BINDING | 0.63 | 0.52 | 0.57 | BINDING | 0.41 | 0.45 | 0.43 |
| METAL | 0.05 | 0.02 | 0.03 | ||||
| DISULFID | 0.53 | 0.15 | 0.23 | ||||
| CROSSLNK | 0 | 0 | 0 | ||||
| DNA_BIND | 0 | 0 | 0 | ||||
| NP_BIND | 0 | 0.06 | 0 | ||||
| ZN_FING | 0 | 0 | 0 | ||||
| CA_BIND | 0 | 0 | 0 | ||||
| ENZ_ACT | 0.43 | 0.20 | 0.27 | ACT_SITE | 0.45 | 0.31 | 0.36 |
| CELL | 0.50 | 0.31 | 0.38 | N/A | |||
| GEN_BIOL | 0.70 | 0.64 | 0.67 | GEN_BIOL | 0.76 | 0.65 | 0.70 |
| GEN_ENG | 0.21 | 0.32 | 0.26 | GEN_ENG | 0.23 | 0.32 | 0.27 |
Overall F1 measure of the entire classification. The global F1 measure of the classification problem is computed by two different types of averages: micro-average and macro-average. In micro-averaging, the F1 measure is calculated globally over all categories, while in macro-averaging, F1 is computed locally over each category first, and then the average over all categories is taken.
| MAN | FEAT | |
| F1(micro-averaged) | 0.56 | 0.55 |
| F1(macro-averaged) | 0.43 | 0.19 |
Performance analysis of the classifiers (confusion matrix). Classification with categories from MAN were analysed by a 100 times 5-fold cross-validation. The result is represented as a confusion matrix.
| Actual | Prediction | |||||||
| BINDING | GEN_BIOL | CELL | CHEM_MOD | GEN_ENG | ENZ_ACT | STR_COMP | STR_MOD | |
| BINDING | 762 | 28 | 93 | 165 | 26 | 546 | 0 | |
| GEN_BIOL | 560 | 525 | 1,496 | 4,514 | 159 | 1,714 | 65 | |
| CELL | 96 | 1,167 | 150 | 325 | 91 | 67 | 0 | |
| CHEM_MOD | 38 | 1,103 | 12 | 761 | 79 | 546 | 25 | |
| GEN_ENG | 144 | 2,556 | 126 | 510 | 46 | 480 | 35 | |
| ENZ_ACT | 33 | 338 | 80 | 201 | 226 | 457 | 0 | |
| STR_COMP | 160 | 783 | 64 | 551 | 592 | 35 | 11 | |
| STR_MOD | 1 | 91 | 1 | 129 | 125 | 0 | 21 | |
Figure 3Performance evaluation of the functional annotation extraction system. Annotation extraction is dependent on the performances of two text mining modules: protein residue identification and contextual feature extraction. The analysis compares the extraction depending on both modules, and the extraction depending solely on contextual feature extraction, i.e. assuming all protein residues are correctly identified. The performance was measured in terms of precision, recall, and F1 measure.
GO terms are not suitable for protein residue annotation. The presented examples demonstrate that predicted GO terms are not always suitable for protein residue annotation. The prediction of GO terms was done with an information theory based parser [34].
| Annotation | |||
| Example | Sentence | Manual | GO |
| 1 | "The catalytic mechanism of the non-phosphorylating glyceraldehyde-3-phosphate dehydrogenase and the other aldehyde dehydrogenases resembles a thioester mechanism involving the universally conserved cysteine 298 (pea GAPN)." (PMID:9461340) | thioester mechanism, conserved cysteine | glyceraldehyde-3-phosphate dehydrogenase (NADP+)(phosphorylating activity), glyceraldehyde-3-phosphate biosynthesis, glyceraldehyde-3-phosphate catabolism, phosphoglycerate dehydrogenase activity |
| 2 | "However, mutations of a key residue, His48, show significant deviation from the relationship, implying a role for the side chain in protection of the complex from hydroxide attack." (PMID:2690955) | protection of the complex from hydroxide attack | AT DNA binding, tRNA, tyrosine tRNA ligase activity |
| 3 | "Second, this reactive cysteinyl residue, which is required for L-cysteine desulfurization activity, was identified as Cys325 by the specific alkylation of that residue and by site-directed mutagenesis experiments." (PMID:81615929) | L-cysteine desulfurization activity | pyridoxal biosynthesis, phosphate binding, mutagenesis, nitrogenase activity, L-alanine biosynthesis, pyridoxal phosphate binding |