| Literature DB >> 24143170 |
Michael J Bell1, Matthew Collison, Phillip Lord.
Abstract
A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge; during the curation process annotations are often propagated between entries. However, this is often not made explicit. Therefore, it can be hard, potentially impossible, for a reader to identify where an annotation originated from. Within this work we attempt to identify annotation provenance and track its subsequent propagation. Specifically, we exploit annotation reuse within the UniProt Knowledgebase (UniProtKB), at the level of individual sentences. We describe a visualisation approach for the provenance and propagation of sentences in UniProtKB which enables a large-scale statistical analysis. Initially levels of sentence reuse within UniProtKB were analysed, showing that reuse is heavily prevalent, which enables the tracking of provenance and propagation. By analysing sentences throughout UniProtKB, a number of interesting propagation patterns were identified, covering over [Formula: see text] sentences. Over [Formula: see text] sentences remain in the database after they have been removed from the entries where they originally occurred. Analysing a subset of these sentences suggest that approximately [Formula: see text] are erroneous, whilst [Formula: see text] appear to be inconsistent. These results suggest that being able to visualise sentence propagation and provenance can aid in the determination of the accuracy and quality of textual annotation. Source code and supplementary data are available from the authors website at http://homepages.cs.ncl.ac.uk/m.j.bell1/sentence_analysis/.Entities:
Mesh:
Year: 2013 PMID: 24143170 PMCID: PMC3797126 DOI: 10.1371/journal.pone.0075541
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Outline view of the data extraction process.
(1) Initially we download a complete dataset for a given database version in flat file format. (2) We then extract the comment lines (lines beginning with ‘CC’, the comment indicator). (3) We remove comment blocks and properties (as defined in the UniProtKB manual [17]) and the ‘CC’ identifier. (4) Sentences are then extracted, using LingPipe. (5) Finally, all of the identified sentences are added to the MySQL database.
Figure 2Total sentences and entries.
The total number of sentences and entries in (A) Swiss-Prot and (B) TrEMBL.
Figure 3Average sentences per entry.
The number of sentences that appear, on average, in each entry in TrEMBL and Swiss-Prot (i.e. the total number of sentences divided by the total number of entries).
Figure 4Showing the distribution of sentence reuse through Swiss-Prot and TrEMBL.
(A) Swiss-Prot Version 9 (B) UniProtKB/Swiss-Prot Version 2012_05 (C) TrEMBL Version 1 (D) UniProtKB/TrEMBL Version 2012_05. As an example, in Figure A, the bottom right point states that there is ∼5000 sentences that occur a single time, whilst the top-left-most point indicates that there is one sentence that occurs ∼125 times.
Figure 5Average Entries Per Sentence.
Graph showing the average number of entries that each sentence appears in for (A) Swiss-Prot and (B) TrEMBL.
Figure 6Entries without annotation.
(A) Number of entries in TrEMBL and Swiss-Prot without any annotation, and (B) the percentage of entries without any annotation.
Figure 7Unique sentences.
(A) The number of unique sentences in Swiss-Prot and TrEMBL and (B) the percentage of unique sentences in Swiss-Prot and TrEMBL.
Figure 8Singleton sentences.
The number of singleton sentences in Swiss-Prot and TrEMBL and (B) the percentage of singleton sentences in Swiss-Prot and TrEMBL.
Figure 9Manual illustration showing how the propagation of a single sentence could be visualised.
Accession numbers are shown on the X-Axis, with database release dates shown on the Y-Axis. A point on the graph represents that the sentence occurs in an entry within a given database version. For example, the bottom left point shows that the sentence occurs in accession entry Q9NQX7 for Swiss-Prot in 2000 – this sentence remains in Q9NQX7 for one more version; it is removed in the following version (in 2002).
Figure 10Visualising sentence propagation.
Visualising the propagation of the sentence “the active-site selenocysteine is encoded by the opal codon, uga.” through the database, with all possible versions of Swiss-Prot and TrEMBL within this range shown at either end of the graph.
Figure 11Illustrating the key interactive features provided by Highchart graphs.
(A) Hovering over a point indicates the corresponding accession number and database version. (B) Clicking on a point provides links to the UniProt entry and further information. (C) Each graph can be printed and exported into a variety of image formats. (D) The ability to zoom into a section of a graph; this can be achieved by left-clicking and dragging a desired area.
Figure 12Visualising sentence occurrences.
The number of UniProtKB entries that the sentence “the active-site selenocysteine is encoded by the opal codon, uga.” appears in over time.
Figure 13Visualising a sentence originating in TrEMBL.
An example of a sentence (“inactivated by cyanide.”) that originates in TrEMBL, but ends up in Swiss-Prot. In this case, a number of the TrEMBL entries are merged into Swiss-Prot.
Number of identified propagation patterns.
| Pattern | Number of sentences | Number in just UniProtKB Version 2012_05 |
| Missing Origin | 8355 | 3835 |
| Reappearing Entry |
| 7011 |
| Transient appearance |
|
|
| Originating in TrEMBL | 8649 | 5330 |
The number of sentences that adhere to each pattern, for all versions of UniProtKB and those just in the latest version of UniProtKB. To place these results in context, there have been a total of unique sentences, with unique sentences being in UniProtKB Version 2012_05.
Figure 14Visualising sentence propagation.
Visualisation for the sentence “the active-site selenocysteine is encoded by the opal codon, uga (by similarity).”
Figure 15Decision tree summarising the protocol used to determine the classification of a sentence.
There are four main questions within the protocol that lead to a sentence being classified into one of five possible classifications.
Figure 16Visualising sentence propagation.
Visualising the propagation of the sentence “may have an essential function in lipopolysaccharides biosynthesis.” through the database.
All analysed sentences and their classification.
| Sentence | Classification |
| belongs to the 40s cdc5-associated complex (or cwf complex), a spliceosome sub-complex reminiscent of a late-stage spliceosome composed of the u2, u5 and u6 snrnas and at least brr2, cdc5, cwf2, cwf3, cwf4, cwf5, cwf6, cwf7, cwf8, cwf9, cwf10, cwf11, cwf12, cwf13, cwf14, cwf15, cwf16, cwf17, cwf18, cwf19, cwf20, cwf21, cwf22, cwf23, cwf24, cwf25, cwf26, cwf27, cwf28, ist3, lea1, msl1, prp5, prp10, prp12, prp17, prp22, sap61, sap62, sap114, sap145, slu7, smb1, smd1, smd3, smf1, smg1 and syf2. | Inconsistent |
| the light chain is composed of three structural domains: a large globular n-terminal domain which may be involved in binding to kinesin heavy chains, a central alpha-helical coiled-coil domain that mediates the light chain dimerization; and a small globular c-terminal which may play a role in regulating mechanochemical activity or attachment of kinesin to membrane-bound organelles (by similarity). | Erroneous |
| the biological conversion of cellulose to glucose generally requires three types of hydrolytic enzymes: 1) endoglucanases which cut internal beta-1,4-glucosidic bonds; 2) exocellobiohydrolases that cut the dissaccharide cellobiose from the nonreducing end of the cellulose polymer chain; 3) beta-1,4-glucosidases which hydrolyze the cellobiose and other short cello-oligosaccharides to glucose. | Inconsistent |
| in the hair cortex, hair keratin intermediate filaments are embedded in an interfilamentous matrix, consisting of hair keratin-associated protein (krtap), which are essential for the formation of a rigid and resistant hair shaft through their extensive disulfide bond cross-linking with abundant cysteine residues of hair keratins. | Inconsistent |
| the beta subunit of voltage-dependent calcium channels contributes to the function of the calcium channel by increasing peak calcium current, shifting the voltage dependencies of activation and inactivation, modulating g protein inhibition and controlling the alpha-1 subunit membrane targeting (by similarity). | Erroneous |
| interacts with the c-terminal of peptidylglycine alpha-amidating monooxygenase (pam) and may act as part of a signal transduction system linking the catalytic domains of pam in the lumen of the secretory pathway to cytosolic factors regulating the cytoskeleton and signal transduction pathways. | Erroneous |
| The modification is dependent on dna and is involved in the regulation of various important cellular processes such as differentiation, proliferation, and tumor transformation and also in the regulation of the molecular events involved in the recovery of cell from dna damage (by similarity). | Erroneous |
| adenosylhomocysteine is a competitive inhibitor of s-adenosyl-l-methinine-dependent methyl transferase reactions; therefore adenosylhomocysteinase may play a key role in the control of methylations via regulation of the intracellular concentration of adenosylhomocysteine (by similarity). | Inconsistent |
| component of the multisynthetase complex which is comprised of a bifunctional glutamyl-prolyl-trna synthetase, the monospecific isoleucyl, leucyl, glutaminyl, methionyl, lysyl, arginyl, and aspartyl-trna synthetases as well as three auxiliary proteins, p18, p48 and p43 (by similarity). | Erroneous |
| self; 2; ebi-311928, ebi-311928; p03949:abl-1; 4; ebi-311928, ebi-2315883; q17539:c01b10.8; 5; ebi-311928, ebi-311920; q95qi7:daf-3; 2; ebi-311928, ebi-326363; q09248:dnc-2; 2; ebi-311928, ebi-316282; q09975:lys-8; 2; ebi-311928, ebi-313861; q21831:snfc-5; 2; ebi-311928, ebi-360213; | Erroneous |
| the n-terminal of the protein extends into the stroma where it is involved with adhesion of granal membranes and photoregulated by reversible phosphorylation of its threonine residues; both are believed to mediate the distribution of excitation energy between photosystems i and ii. | Inconsistent |
| the modification is dependent on dna and is involved in the regulation of various important cellular processes such as differentiation, proliferation, and tumor transformation and also in the regulation of the molecular events involved in the recovery of cell from dna damage. | Erroneous |
| the iicd domains contain the sugar binding site and the transmembrane channel; the iia domain contains the primary phosphorylation site (the donor is phospho-hpr); iia transfers its phosphoryl group to the iib domain which finally transfers it to the sugar (by similarity). | Too Many |
| adenosylhomocysteine is a competitive inhibitor of s-adenosyl-l-methinine-dependent methyl transferase reactions; therefore adenosylhomocysteinase may play a key role in the control of methylations via regulation of the intracellular concentration of adenosylhomocysteine. | Inconsistent |
| this delta-9 desaturase is a terminal component of the liver microsomal stearyl-coa desaturase system, that utilizes o(2) and electrons from reduced cytochrome b(5) to catalyze the insertion of a double bond into a spectrum of fatty acyl-coa substrates (by similarity). | Inconsistent |
| in the absence of mercury merr represses transcription by binding tightly to the mer operator region; when mercury is present the dimeric complex binds a single ion and becomes a potent transcriptional activator, while remaining bound to the mer site (by similarity). | Erroneous |
| chemotactic-signal tranducers respond to changes in the concentration of attractants and repellents in the environment, transduce a signal from the outside to the inside of the cell, and facilitate sensory adaptation through the variation of the level of methylation. | Inconsistent |
| activated by tyrosine-phosphorylation in response to either integrin clustering induced by cell adhesion or antibody cross-linking, or via g-protein coupled receptor (gpcr) occupancy by ligands such as bombesin or lysophosphatidic acid, or via ldl receptor occupancy. | Erroneous |
| laminin is a complex glycoprotein, consisting of three different polypeptide chains (alpha, beta, gamma), which are bound to each other by disulfide bonds into a cross-shaped molecule comprising one long and three short arms with globules at each end (by similarity). | Erroneous |
| psi is a plastocyanin-ferredoxin oxidoreductase, converting photonic excitation into a charge separation, which transfers an electron from the donor p700 chlorophyll pair to the spectroscopically characterized acceptors a0, a1, fx, fa and fb in turn (by similarity). | Erroneous |
| involved in protection of chromosomal dna from damage under nutrient-limited and oxidative stress conditions. | Inconsistent |
| belongs to the cold-shock domain (csd) family. | Too Many |
| p35415:prm; 1; ebi-86215, ebi-133215; | Erroneous |
| composed of 14 different subunits. | Possibly Erroneous |
| proteins that associate with the core dimer include three families of regulatory subunits b (the r2/b/pr55/b55, r3/b″/pr72/pr130/pr59 and r5/b′/b56 families), the 48 kda variable regulatory subunit, viral proteins, and cell signaling molecules (by similarity). | Inconsistent |
| type i restriction and modification enzymes are complex, multifunctional systems which require atp, s-adenosyl methionine and mg(2+) as cofactors and, in addition to their endonucleolytic and methylase activities, are potent dna-dependent atpases (by similarity). | Inconsistent |
| 3-beta-hydroxy-delta(5)-steroid + nad(+) = 3-oxo-delta(5)-steroid + nadh (acts on 3-beta-hydroxyandrost-5-en-17-one to form androst-4-ene-3,17-dione and on 3-beta-hydroxypregn -5-en-20-one to form progesterone). | Accurate |
| udp-n-acetyl-d-glucosamine + n-acetyl-beta-d-glucosaminyl-1,2-alpha-d-mannosyl-1,3(6)-(n-acetyl-beta-d-glucosaminyl-1,2-alpha-d-mannosyl,1,6(3))-beta-d-mannosyl-1,4-n-acetyl-beta-d-glucosaminyl-r = udp + n-acetyl-beta-d-glucosaminyl-1,2-(n-acetyl-beta-d-glucosaminyl-1,6)-1,2-alpha-d-mannosyl-1,3(6) -(n-acetyl-beta-d-glucosaminyl-1,2-alpha-d-mannosyl-1,6(3))-beta-d-mannosyl-1,4-n-acetyl-beta-d-glucosaminyl-r. | Erroneous |
| in e.coli rnase h participare in dna replication; it helps to specify the origin of genomic replication by suppressing initiation at origins other than the locus oric; along with the 5–′3éxonuclease of pol1, it removes rna primers from the okazaki fragments of lagging strand symthesis; and it defines the origin of replication for cole1-type plasmids by specific cleavage of an rna preprimer. | Inconsistent |
| thoracic aortic aneurysms and dissections are primarily associated with a characteristic histologic appearance known as m´ | Erroneous |
| component of the cleavage and polyadenylation specificity factor (cpsf) complex that play a key role in pre-mrna 3–′end formation, recognizing the aauaaa signal sequence and interacting with poly(a) polymerase and other factors to bring about cleavage and poly(a) addition (by similarity). | Inconsistent |
| there are two operons: the xylcab operon is responsible for the upper metabolic pathway from toluene to aromatic carboxylic acids, & the xyldlefg operon is required for the lower catabolic pathway from aromatic carboxylic acids to compounds that enter the trycarboxylic acid cycle. | Erroneous |
| hh is characterized by abnormal intestinal iron absorption and progressive increase of total body iron, which results in midlife in clinical complications including cirrhosis, cardiopathy, diabetes, endocrine dysfunctions, arthropathy, and susceptibility to liver cancer. | Inconsistent |
| prp is found in high quantity in the brain of humans and animals infected with the degenerative neurological diseases kuru, creutzfeldt-jacob disease (cjd), gerstmann-straussler syndrome (gss), scrapie, bovine spongiform encephalopathy (bse), etc. to other prp. | Accurate |
| involved in the atp-dependent selective degradation of cellular proteins, the maintenance of chromatin structure, the regulation of gene expression, the stress response, and ribosome biogenesis (by similarity). | Erroneous |
| coup (chicken ovalbumin upstream promoter) transcription factor binds to the ovalbumin promoter and, in cunjunction with another protein (s300-ii) stimulates initiation of transcription. | Inconsistent |
| the lys-124 ubiquitination also modulates the formation of double-strand breaks during meiosis and is a prerequisite for and dna-damage checkpoint activation (by similarity). | Erroneous |
| the export to cytoplasm depends on the interaction with a 14-3-3 chaperone protein and is due to its phosphorylation at ser-259 and ser-498 by camk (by similarity). | Erroneous |
| the sigma factor is an initiation factor that promotes attachment of the rna polymerase to specific initiation sites and then is released (by similarity). | Too Many |
| hydrolysis of 1,4-alpha-d-glucosidic linkages in polysaccharides so as to remove successive maltose units from the non-reducing ends of the chains. | Accurate |
| the resulting products may subsequently be converted to the corresponding alcohols that are incorporated into lignins (by similarity). | Erroneous |
| involved in the initial immune cell clustering during inflammatory response and may regulate chemotactic activity of chemokines. | Inconsistent |
| s-adenosyl-l-methionine + magnesium protoporphyrin = s-adenosyl-l-homocysteine + magnesium protoporphyrin monomethyl ester. | Erroneous |
| component of the coat surrounding the cytoplasmic face of coated vesicles located at the golgi complex (by similarity). | Accurate |
| hsp82 is an essential protein that is required by cells in higher concentrations for growth at higher temperatures. | Accurate |
| monoubiquitinated on lys-147; may give a specific tag for epigenetic transcriptional activation (by similarity). | Erroneous |
| probably a dodecamer composed of six biotin-containing alpha subunits and six beta subunits (by similarity). | Possibly Erroneous |
| organized into a structure (processome or rna degradosome) containing a number of rna-processing enzymes. | Inconsistent |
| involved in the formation of the nuclear envelope and of the transitional endoplasmic reticulum (ter). | Inconsistent |
| this methionine-rich region is probably important for copper tolerance in bacteria (by similarity). | Erroneous |
| they have identical ligand binding properties but different coupling properties with g proteins. | Possibly Erroneous |
| 3-carboxy-2-hydroxy-4-methylpentanoate + nad(+) = 3-carboxy-4-methyl-2- oxopentanoate + nadh. | Accurate |
| this is a conceptual translation; two frameshifts had to be introduced to produce this orf. | Erroneous |
| component of the infraciliary lattice (icl) and the ciliary basal bodies (by similarity). | Possibly Erroneous |
| catalyzes the methylation of c-11 in precorrin-4 to form precorrin-5 (by similarity). | Possibly Erroneous |
| on the 2d-gel the determined pi of this unknown protein is: 6.2, its mw is: 28 kda. | Accurate |
| heterodimer of a p110 (catalytic) and a p85 (regulatory) subunit (by similarity). | Accurate |
| this viral protein may be involved in the regulation of the complement cascade. | Inconsistent |
| two forms; long (shown here) and short; are produced by alternative splicing. | Inconsistent |
| assembles at the inner surface of the cytoplasmic membrane (by similarity). | Too Many |
| 1-aminocyclopropane-1-carboxylate + o2 = ethylene + hcn + co(2) + 2 h(2)o. | Accurate |
| bind preferentially single-stranded dna and unwind double stranded dna. | Inconsistent |
| involved in the regulation of hydrogenase expression (by similarity). | Erroneous |
| may have an essential function in lipopolysaccharides biosynthesis. | Erroneous |
| rch(2)nh(2) + h(2)o + acceptor = rcho + nh(3) + reduced acceptor. | Accurate |
| subunit 1 binds to the primer-template junction (by similarity). | Inconsistent |
| to immunoglobulin and major histocompatibility complex domain. | Too Many |
| isoform 3: membrane; multi-pass membrane protein (potential). | Possibly Erroneous |
| the beta subunit seems to be encoded by a multigene family. | Erroneous |
| atp + adenylylsulfate = adp + 3–′phosphoadenylylsulfate. | Inconsistent |
| an aryl sulfate + a phenol = a phenol + an aryl sulfate. | Erroneous |
| peptidyl-l-amino acid + h(2)o = peptide + l-amino acid. | Possibly Erroneous |
| in the c-terminus to yeast sla2 and c.elegans zk370.3. | Erroneous |
| mediates e2-dependent ubiquitination (by similarity). | Accurate |
| villin is a ca(2+)-regulated actin-binding protein. | Inconsistent |
| atp + undecaprenol = adp + undecaprenyl phosphate. | Accurate |
| aminoacyl-peptide + h(2)o = amino acid + peptide. | Inconsistent |
| to the calcitonin and to the secretin receptors. | Erroneous |
| heterodimer of an alpha chain and a beta chain. | Too Many |
| requires ca2+ and mn2+ ions for full activity. | Inconsistent |
| contains 1 immunoglobulin-like v-type domain. | Too Many |
| belongs to family 13 of glycosyl hydrolases. | Too Many |
| acts as a transglycosylase (by similarity). | Erroneous |
| nuclear effector molecule (by similarity). | Possibly Erroneous |
| involved in carbon catabolite repression. | Erroneous |
| q9vy42:cg1461; 1; ebi-194476, ebi-127720; | Erroneous |
| contains 6 ldl-receptor class b domains. | Erroneous |
| ring cleavage of 2,3-dihydroxybiphenyl. | Possibly Erroneous |
| not expected to have protease activity. | Accurate |
| secreted in hemolymph (by similarity). | Accurate |
| interacts with rad51 (by similarity). | Accurate |
| endplasmic reticulum membrane bound. | Accurate |
| associated with the plasma membrane. | Accurate |
| does not have a catalytic activity. | Possibly Erroneous |
| belongs to the eae/invasin family. | Erroneous |
| interacts with cyclin g in vitro. | Possibly Erroneous |
| self; 1; ebi-190958, ebi-190958; | Possibly Erroneous |
| binds 1 nickel ion per monomer. | Accurate |
| binds 1 magnesium per subunit. | Inconsistent |
| clavulanic acid biosynthesis. | Accurate |
| belongs to the ycf50 family. | Accurate |
| inhibited by acetazolamide. | Erroneous |
| involved in tumorigenesis. | Accurate |
| acetyltransferase enzyme. | Possibly Erroneous |
| phosphorylates ppp1r12a. | Possibly Erroneous |
| detected at low levels. | Accurate |
| interacts with trim28. | Accurate |
| contacts protein l19. | Erroneous |
| interacts with gcn5. | Accurate |
| may self-associate. | Accurate |
| secreted in milk. | Too Many |
| heme-thiolate. | Accurate |
| adipocytes. | Accurate |
| nadp. | Accurate |
| nuclear. | Too Many |
| p. | Too Many |
| 25. | Too Many |
| 1. | Too Many |
| 3. | Too Many |
| 2. | Too Many |
| venom. | Inconsistent |
| roots. | Inconsistent |
| leaf. | Inconsistent |
All of sentences analysed, and their corresponding classification. Sentences have been stored in lowercase to allow for case insensitive comparison. For further information, including the entries affected by these sentences, please see the authors website.
Sentence classification results.
| Classification | Erroneous | Inconsistent | Accurate | Too many | Possibly Erroneous |
| Absolute | 36 | 29 | 28 | 15 | 14 |
| Percentage | 29.5% | 23.8% | 23.0% | 12.3% | 11.5% |
| Potentially Erroneous | 2465 | 1986 | 1918 | 1027 | 959 |
The classification results for all of the analysed sentences (122 in total).
Classification of sentences over 20 characters in length.
| Classification | Erroneous | Inconsistent | Accurate | Too many | Possibly Erroneous |
| Absolute | 16 | 11 | 20 | 5 | 13 |
| Percentage | 24.6% | 16.9% | 30.8% | 7.7% | 20.0% |
| Potentially Erroneous | 2057 | 1414 | 2571 | 643 | 1671 |
The classification results of the 65 sentences analysed, controlling for sentence length bias (i.e. every 100th sentence over 20 characters in length).
Classification of sentences in UniProtKB 2012_05.
| Classification | Erroneous | Inconsistent | Accurate | Too many | Possibly Erroneous |
| Absolute | 4 | 5 | 12 | 1 | 10 |
| Percentage | 12.5% | 15.6% | 37.5% | 3.1% | 31.3% |
| Potentially Erroneous | 479 | 599 | 1438 | 120 | 1198 |
The classification of results for the subset of sentences controlling for sentence length bias analysed that remain in UniProtKB Version 2012_05.