| Literature DB >> 27613112 |
Christopher S Funk1, K Bretonnel Cohen2, Lawrence E Hunter2, Karin M Verspoor3,4.
Abstract
BACKGROUND: Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms.Entities:
Keywords: Biomedical concept recognition; Gene ontology; Named entity recognition; Text-mining
Mesh:
Year: 2016 PMID: 27613112 PMCID: PMC5018193 DOI: 10.1186/s13326-016-0096-7
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1Finite state automata representing activation, proliferation, and differentiation GO terms. An abstracted FSA adapted from a figure in Ogren et al. [20] that shows how a particular term can be decomposed into its smaller components; where “cell type” can be any specific type of cell
Example ontology entry for the concept “membrane budding”
| id: GO:0006900 |
| name: membrane budding |
| namespace: biological_process |
| def: ~The evagination of a membrane resulting in formation of a vesicle.~ |
| synonym: ~membrane evagination~ EXACT |
| synonym: ~nonselective vesicle assembly~ RELATED |
| synonym: ~vesicle biosynthesis~ EXACT |
| synonym: ~vesicle formation~ EXACT |
| is_a: GO:0016044 ! membrane organization and biosynthesis |
| relationship: part_of GO:0016192 ! vesicle-mediated transport |
Examples of the “membrane budding” concept within a single document
| Lipid rafts play a key role in |
| Having excluded a direct role in |
| …involvement of annexin A7 in |
| …Ca2+-mediated |
| Red blood cells which lack the ability to |
Recursive syntactic rules order, constituent terms, and example generated synonyms
| Order | Rule | GO term | Constituent terms | Generated synonyms |
|---|---|---|---|---|
| 1 | “via” or “involved in” terms | GO:0002679 - respiratory burst involved in defense response | “respiratory burst”, “defense response” | “defense response associated respiratory burst” |
| 2 | “regulation of” terms | GO:0030513 - positive regulation of BMP signaling pathway | “BMP signaling pathway” | ‘positive regulation of BMP receptor pathway”, “up-regulation of BMP receptor signaling” |
| 3 | “response to” terms | GO:0034263 - autophagy in response to ER overload | “autophagy”, “ER overload” | “ER overload responsible for autophagy”, “autophagy response to ER overload” |
| 4 | “signaling” terms | GO:0035329 - hippo signaling | “hippo” | “hippo signaling pathway”, “signaling of hippo” |
| 5 | “biosynthetic process” terms | GO:0042095 - interferon-gamma biosynthetic process | “interferon-gama” | “interferon-gamma biosynthesis”, “production of interferon-gamma” |
| 6 | “metabolic process” terms | GO:0042120 - alginic acid metabolic process | “alginic acid” | “metabolism of alginic acid”, “alginic acid metabolism” |
| 7 | “catabolic process” terms | GO:0042190 - vanillin catabolic process | “vanillin” | “vanillin degradation”, “breakdown of vanillin” |
| 8 | “binding” terms | GO:0042314 - bacteriochlorophyll binding | “bacteriochlorophyll” | “binding of bacteriochlorophyll”, “bacteriochlorophyll bound” |
| 9 | “transport” terms | GO:0042876 - aldarate transmembrane transporter activity | “aldarate”, “transmembrane” | “transportation of aldarate across the membrane”, “transporting aldarate transmembrane” |
| 10 | “differentiation” terms | GO:0043158 - heterocyst differentiation | “heterocyst” | “heterocyst cell differentiation”, “differentiation into heterocyst” |
| 11 | “activity” terms | GO:0043492 - ATPase activity, coupled to movement of substances | “ATPase”, “coupled to movement of substances” | “ATPase, coupled to movement of substances”, “coupled to movement of substances activity of ATPase” |
While these examples show only one rule applied at once, each constituent term identified recursively goes through each rule in the order outlined to determine the most basic constituent terms, which will get derivational variations (discussed in next paragraph) and then combinatorially re-combined into generated synonyms of the original term
Individual derivational variant generation rules
| Order | Rule | Rule defined | GO terms | Example derivations |
|---|---|---|---|---|
| 1 | Single word terms | 1 {NN} ⇒ {JJ} | 1 GO:0043066 - negative regulation of | 1 “ |
| 2 {NN} ⇒ {VB} | 2 GO:0023040 - | 2 “ | ||
| 2 | Double word terms | 1 {NN_1 NN_2} ⇒ {NN_1}, {VB_2 NN_1}, {JJ_1 NN_2}, {NN_1 JJ_2} | 1 GO:0048666 - | 1 “ |
| 2 {JJ_1 NN_2} ⇒ {JJ_1}, {JJ_1 JJ_2} | 2 GO:0005576 - | “ | ||
| 3 | Triple word terms | 1 {NN_1 NN_2 NN_3} ⇒ {NN_1 NN_3}, {NN_3 NN_1}, {VB_3} | 1 GO:0052386 - cell wall | 1 “ |
| 4 | “cell part” terms | Introduce and re-order cell part terms | GO:0035452 - | “ |
| 5 | “sensory perception” terms | Introduce variants of the sense - “sensory perception of {NN}” | GO:0050909 - sensory perception of | “ |
| 6 | “transcription, | Introduce variants of “transcription” | GO:0006410 - | “RNA-dependent |
| 7 | “ | Introduce variants of “annealing” | GO:0033592 - RNA strand | “RNA |
The seven patterns that we generate derivational variants are presented along with examples of each. While these are presented individually, all derivational and recursive syntactic (presented in Table 3) interact at each step. The examples provided are single GO terms, but any of the constituent terms produced through the above steps will go through all derivational rules, if possible. The bolded words in the GO Term and Synonyms generated column represent the impact of the rule. The Penn Treebank part-of-speech (POS) tags are utilized below: NN = noun, VB = verb, JJ = adjective. All varying forms were converted to the basic POS tag, e.g. NNS = plural noun and were converted to NN
Fig. 2Three steps of synonym generation applied. A single GO concept broken down into its composite parts (bolded and underlined), synonyms generated for each part (text underneath the part), then combination of all synonyms from all composite parts to form complete synonym of the original concept
Micro-averaged results for each synonym generation method on the CRAFT corpus
| Method | TP | FP | FN | Precision | Recall | F-measure |
|---|---|---|---|---|---|---|
| Baseline (B1) | 10,778 | 6,280 | 18,669 | 0.632 | 0.366 | 0.464 |
| Baseline (B2) | 12,217 | 7,367 | 17,230 | 0.624 | 0.415 | 0.498 |
| All external synonyms | 12,747 | 11,682 | 16,704 | 0.522 | 0.433 | 0.473 |
| Recursive syntactic rules | 12,411 | 7,587 | 17,036 | 0.621 | 0.422 | 0.502 |
| Recursive syntactic and derivational rules | 18,611 | 10,507 | 10,836 | 0.639 | 0.632 |
|
Bold highlighting indicates the method that produces the highest F-measure
Performance of manual Gene Ontology rules on the CRAFT corpus
| Method | Generated synonyms | Affected terms | TP | FP | FN | P | R | F |
|---|---|---|---|---|---|---|---|---|
| Cellular Component (CC) | ||||||||
| Baseline (B1) | X | X | 5,532 | 452 | 2822 | 0.925 | 0.662 | 0.772 |
| Baseline (B2) | X | X | 5,532 | 452 | 2822 | 0.925 | 0.662 | 0.772 |
| Syntactic recursion rules | 23 | 21 | 5,532 | 452 | 2,822 | 0.925 | 0.662 |
|
| Both rules | 4,083 | 724 | 6,585 | 969 | 1,769 | 0.872 | 0.788 |
|
| Molecular Function (MF) | ||||||||
| Baseline (B1) | X | X | 337 | 146 | 3,843 | 0.698 | 0.081 | 0.145 |
| Baseline (B2) | X | X | 1,772 | 964 | 2,408 | 0.648 | 0.424 | 0.512 |
| Syntactic recursion rules | 11,637 | 7,353 | 1,759 | 977 | 2,421 | 0.643 | 0.421 | 0.509 |
| Both rules | 14,413 | 7,401 | 2,422 | 1,074 | 1,758 | 0.693 | 0.579 |
|
| Biological Process (BP) | ||||||||
| Baseline (B1) | X | X | 4,909 | 5,682 | 12,004 | 0.464 | 0.290 | 0.357 |
| Baseline (B2) | X | X | 4,913 | 5,951 | 12,000 | 0.452 | 0.291 | 0.354 |
| Syntactic recursion rules | 182,617 | 6,847 | 5,120 | 6,158 | 11,793 | 0.454 | 0.303 |
|
| Both rules | 272,535 | 8,675 | 9,604 | 8,464 | 7,309 | 0.532 | 0.568 |
|
| All Gene Ontology | ||||||||
| Baseline (B1) | X | X | 10,778 | 6,280 | 18,669 | 0.632 | 0.366 | 0.464 |
| Baseline (B2) | X | X | 12,217 | 7,367 | 17,230 | 0.624 | 0.415 | 0.498 |
| Syntactic recursion rules | 194,277 | 14,221 | 12,411 | 7,588 | 17,036 | 0.621 | 0.422 |
|
| Both rules | 291,031 | 16,800 | 18,611 | 10,507 | 10,836 | 0.640 | 0.632 |
|
Bold highlighting indicates where the generated synonyms have a positive effect on the performance
The top 5 derivational synonyms that improve performance on the CRAFT corpus
| GO ID | Term name |
|
|
| Generated synonyms |
|---|---|---|---|---|---|
| Cellular Component | |||||
| GO:0019814 | Immunoglobulin complex | +548 | +0 | −548 | Antibody, antibodies |
| GO:0005634 | Nucleus | +218 | +35 | −218 | Nuclear, nucleated |
| GO:0005739 | Mitochondrion | +135 | +0 | −135 | Mitochondrial |
| GO:0031982 | Vesicle | +11 | +3 | −11 | Vesicular |
| GO:0005856 | Cytoskeleton | +15 | +0 | −15 | Cytoskeletal |
| Molecular Function | |||||
| GO:0000739 | DNA strand annealing activity | +327 | +1 | −327 | Hybridized, hybridization, annealing, annealed |
| GO:0033592 | RNA strand annealing activity | +327 | +1 | −327 | Hybridized, hybridization, annealing, annealed |
| GO:0031386 | Protein tag | +6 | +79 | −6 | Tag |
| GO:0005179 | Hormone activity | +1 | +0 | −1 | Hormonal |
| GO:0043495 | Protein anchor | +1 | +10 | −1 | Anchor |
| Biological Process | |||||
| GO:0010467 | Gene expression | +2235 | +361 | −2235 | Expression, expressed, expressing |
| GO:0007608 | Sensory perception of smell | +445 | +1 | −445 | Olfactory |
| GO:0008283 | Cell proliferation | +97 | +71 | −97 | Cellular proliferation, proliferative |
| GO:0007126 | Meiosis | +93 | +2 | −93 | Meiotic, meiotically |
| GO:0006915 | Apoptosis | +173 | +2 | −173 | Apoptotic |
The GO terms that increase performance the most on CRAFT are along with the change (Δ) in number of true positives (TP), false positives (FP), and false negatives (FN) from the baseline B2 (“activity” removed baseline). The generated synonyms that result in this increase are shown under ‘Generated synonyms’
Statistics of annotations produced on the large literature collection by information content
| Baseline B2 | With generated synonyms | Impact of synonyms | |||||
|---|---|---|---|---|---|---|---|
| IC | # Terms | # Annotations | # Terms | # Annotations | New concepts | New annotations | Change |
| Undefined | 3,548 | 16,929,911 | 4,303 | 23,653,066 | 755 | 6,723,155 | +39.7 % |
| [0,1) | 7 | 3,202,114 | 7 | 3,177,333 | 0 | −24,781 | −0.1 % |
| [1,2) | 16 | 2,655,365 | 17 | 2,801,431 | 1 | 146,066 | +0.1 % |
| [2,3) | 43 | 7,332,003 | 44 | 8,016,573 | 1 | 684,570 | +0.1 % |
| [3,4) | 94 | 4,474,422 | 101 | 5,188,968 | 7 | 714,546 | +0.2 % |
| [4,5) | 178 | 4,185,438 | 191 | 9,340,757 | 13 | 5,155,319 | +123.8 % |
| [5,6) | 354 | 13,547,423 | 373 | 22,284,670 | 19 | 8,737,247 | +64.4 % |
| [6,7) | 666 | 9,533,940 | 715 | 12,060,499 | 49 | 2,526,559 | +26.3 % |
| [7,8) | 1,044 | 18,354,299 | 1,154 | 21,251,834 | 110 | 2,897,535 | +16.8 % |
| [8,9) | 1,465 | 7,932,937 | 1,648 | 15,316,476 | 183 | 7,383,539 | +92.4 % |
| [9,10) | 1,551 | 4,813,153 | 1,813 | 7,671,601 | 262 | 2,858,448 | +58.3 % |
| [10,11) | 1,396 | 2,390,061 | 1,690 | 4,291,831 | 294 | 1,901,770 | +79.1 % |
| [11,12) | 942 | 1,246,758 | 1,162 | 2,279,005 | 220 | 1,032,247 | +83.3 % |
| [12,13) | 732 | 578,501 | 953 | 1,257,956 | 221 | 679,455 | +117.2 % |
| Total | 12,036 | 97,176,325 | 14,171 | 138,592,000 | 2,135 | 41,415,675 | +42.5 % |
Shows the number of unique terms and total number of annotations produced through baseline B2, both derivational and syntactic recursive rules applied, and the impact the rules have overall. The change is percent change in total annotations
Results of manual inspection of random samples of annotations
| Baseline B2 | With rules | Overall | |||||
|---|---|---|---|---|---|---|---|
| IC | # Terms | # Annotations | Accuracy | # Terms | # Annotations | Accuracy | Accuracy |
| Undefined | 35 | 231 | 0.98 | 75 | 363 | 0.70 | 0.81 |
| [0,1) | 1 | 15 | 0.20 | 0 | 0 | 0.00 | 0.20 |
| [1,2) | 1 | 15 | 1.00 | 1 | 4 | 1.00 | 1.00 |
| [2,3) | 1 | 15 | 1.00 | 1 | 4 | 1.00 | 1.00 |
| [3,4) | 1 | 4 | 1.00 | 1 | 1 | 0.00 | 0.80 |
| [4,5) | 2 | 30 | 0.60 | 2 | 24 | 0.88 | 0.72 |
| [5,6) | 4 | 60 | 0.97 | 2 | 13 | 0.23 | 0.84 |
| [6,7) | 7 | 79 | 0.99 | 5 | 41 | 0.49 | 0.82 |
| [7,8) | 10 | 136 | 0.89 | 11 | 116 | 0.65 | 0.78 |
| [8,9) | 15 | 197 | 0.98 | 19 | 163 | 0.83 | 0.91 |
| [9,10) | 16 | 175 | 0.97 | 26 | 205 | 0.79 | 0.87 |
| [10,11) | 14 | 119 | 0.83 | 30 | 217 | 0.80 | 0.81 |
| [11,12) | 10 | 103 | 0.97 | 22 | 141 | 0.77 | 0.86 |
| [12,13) | 8 | 93 | 0.98 | 22 | 156 | 0.72 | 0.82 |
| Total | 125 | 1272 | 0.94 | 217 | 1448 | 0.74 | 0.83 |
Accuracy, calculated via manual review of textual annotations for correctness, of random subsets of concepts recognized from the large literature collections. We sampled 1 % of concepts, with up to 15 randomly sampled specific text spans per concept, from concepts identified using baseline B2. We sampled 10 % of concepts, with up to 15 randomly sampled text spans per concept, from the new concepts recognized through the presented synonym generation rules. Overall accuracy is calculated by combining annotations of the same IC from baseline and with our rules