| Literature DB >> 18182099 |
Jin-Dong Kim1, Tomoko Ohta, Jun'ichi Tsujii.
Abstract
BACKGROUND: Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation.Entities:
Mesh:
Year: 2008 PMID: 18182099 PMCID: PMC2267702 DOI: 10.1186/1471-2105-9-10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1GENIA term ontology. The hierarchy of the GENIA term ontology. Terminal classes are used for GENIA term annotation. The figures in parenthesis indicate number of annotation instances made to the GENIA corpus.
Figure 2Example of event annotation. GENIA event annotation is made sentence by sentence. Although the actual corpus file with annotation is encoded in XML (C), the annotators work on a CSS-styled view (A) which is much more user-friendly. Sometimes, a graphical representation (B) is used to depict annotated events and their relations in an abstract and concise way. Note that the black, red and blue arcs link an event with its themes, causes and location respectively.
Figure 3GENIA event ontology. The hierarchy of the GENIA event ontology. For event annotation, not only terminal classes but also classes at higher level are allowed to be used. The figures in parenthesis indicate number of annotation instances made to the GENIA corpus.
Figure 4Graphical representation of events in some example sentences. Examples in text with corresponding event annotation in graphical representation. (A) T-cell expression of the human GATA-3 gene is regulated by a non-lineage-specific silencer. (B) The extent of IFN-induced NK cell killing of E1A-expressing cells was proportional to the level of E1A expression ... (C) Cell hemoglobinization was accompanied by the increased expression of genes encoding gamma-globin ... (D) In addition, forced expression of GATA3 potentiated the induction of RALDH2 by TAL1 and LMO, and these three factors formed a complex in vivo.
Figure 5SBML-style event description for the example in Figure 2. The nodes denote biological entities. The links denote transitions between different states of entities and correspond to events causing the state transitions.
Figure 6Graph representations of events about "LMP1 to activate NF-kappa B". (A) expresses the event "LMP1 activates NF-kappa B", and (B) expresses the event "expression of LMP1 activates NF-kappa B". Biological implication of the two expressions is equivalent, i.e. since LMP1 activates NF-kappa B, physical manifestation of LMP1, of course, activates NF-kappa B.
Figure 7Molecular interactions and signaling pathways engaged by LMP1. LMP1 is involved in the activation of NFkB. Even though it has to get through a complex path for the role of LMP1 to take effect on the activation of NFkB, in natural language text, the involvement of LMP1 for the activation of NFkB is often simply written as "LMP1 activates NFkB." Reprinted from [68], Copyright 2001, with permission from Elsevier.
Linguistic realization of the word "inhibit" in various context
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
| • |
Distribution of theme classes for Transcription, Translation, Gene_expression and Binding events
| DNA (538) | Protein (34) | Protein (2,569) | Protein DNA (1,186) |
| RNA (334) | DNA (7) | DNA (904) | Protein Protein (611) |
| Protein (291) | Virus (1) | Virus (47) | Protein (288) |
| Virus (38) | RNA (30) | DNA (77) | |
| *No theme (16) | Peptide (4) | Other_organic_compound Protein (58) | |
| Protein Lipid (48) | |||
| DNA DNA (31) | |||
| Polynucleotide Protein (22) | |||
| Protein Protein Protein (10) | |||
| DNA Protein Protein (8) | |||
| ... |
The lists for the event classes, Transcription, Translation and Gene_expression are complete. For the event class Binding, the 10 most frequent theme patterns are shown. Note that Binding events are allowed to be annotated with more than one themes.
Clue expressions for some event classes
| regulation [of] (178) | binding (256) | translocation [of] (88) |
| involved [in] (139) | binding [to] (123) | translocation (58) |
| effects [of] [on] (137) | binding [of] [to] (120) | secretion (57) |
| role [of] [in] (124) | binding [of] (114) | release (49) |
| dependent (106) | bind [to] (106) | secretion [of] (33) |
| regulated [by] (102) | bind (84) | secreted (23) |
| regulate (101) | binds [to] (83) | release [of] (23) |
| effect [of] [on] (98) | bound [to] (68) | mobilization (16) |
| affect (94) | binding activity (67) | localization [of] (16) |
| regulating (75) | interact [with] (57) | present (13) |
| effect [on] (72) | binds (48) | uptake (12) |
| regulated (66) | bound (44) | import [of] (12) |
| regulation [of] [by] (64) | interacts [with] (42) | released (11) |
| regulates (61) | associated [with] (37) | localization (10) |
| control (50) | interaction [of] [with] (34) | appearance [of] (9) |
| affected [by] (50) | cross-linking (34) | secreting (8) |
| controlled [by] (47) | interaction [with] (29) | mobilization [of] (8) |
| control [of] (45) | interaction (22) | localized (8) |
| regulation (40) | binding activity [of] (22) | uptake [of] (7) |
| plays * role [in] (35) | ligation (21) | translocated (7) |
| affected (34) | binding [for] (20) | translocate (5) |
| transcriptional regulation [of] (33) | interactions (19) | mobilized (5) |
| response [to] (33) | recognized [by] (17) | import (5) |
| effects [on] (33) | engagement (16) | distributed (5) |
| dependent [on] (32) | cross-linking [of] (15) | co-localization [with] (5) |
| play * role [in] (31) | association [with] (15) | translocates [as a result of] (4) |
| involvement [of] [in] (31) | bound [by] (14) | translocates (4) |
| modulating (30) | recognizes (12) | secrete (4) |
| responsible [for] (28) | interacted [with] (11) | migrating (4) |
| effect (28) | interactions [with] (10) | accumulation [of] (4) |
| changes [in] (27) | binding [by] (10) | shuttling [of] (3) |
| controls (25) | association [of] [with] (10) | sequestered [via] (3) |
| role [for] [in] (24) | associates [with] (10) | presence (3) |
| are key regulators [of] (24) | complexed [with] (9) | imported (3) |
| modulate (23) | ligation [of] (8) | expression [of] (3) |
| independent (22) | engagement [of] (8) | delivery [of] (3) |
| role [of] (21) | binding activities [of] (8) | topography (2) |
| regulators [of] (21) | associate [with] (8) | stored (2) |
| controlling (21) | linked [to] (7) | sequestered [by] (2) |
| affecting (21) | interaction [between] (7) | sequester (2) |
| ... | ... | ... |
The 40 most frequently observed clue expressions for each of three event classes, Regulation, Binding and Localization. The asterisk sign (*) indicates discontinuity at that position. Words in square brackets are functional words which appear together with clue expressions to form linguistic patterns to connect the clue expressions to their themes and causes.
Distribution of theme classes for Regulation events
| Positive_regulation (702) | Protein (2,413) | Positive_regulation (1,505) |
| Gene_expression (586) | Gene_expression (1,560) | Protein (595) |
| Protein (453) | Positive_regulation (1,499) | Gene_expression (465) |
| DNA (426) | DNA (902) | Binding (269) |
| Transcription (239) | Transcription (632) | DNA (187) |
| Regulation (237) | Binding (446) | Transcription (164) |
| *No theme (192) | Negative_regulation (356) | Regulation (126) |
| Binding (133) | Cellular_phy_process (345) | Localization (126) |
| Physiological_process (120) | Localization (341) | Cellular_phy_process (122) |
| Negative_regulation (108) | Regulation (309) | Negative_regulation (121) |
| Cell_differentiation (106) | *No theme (277) | Viral_life_cycle (94) |
| Cellular_phy_process (95) | RNA (268) | *No theme (76) |
| Viral_life_cycle (65) | Cell_differentiation (220) | Physiological_process (57) |
| Cell (61) | Protein_phosphorylation (214) | Cell_differentiation (50) |
| Localization (60) | Physiological_process (154) | Cell (50) |
| Cell_communication (36) | Viral_life_cycle (141) | RNA (48) |
| RNA (31) | Cell_adhesion (86) | Cell_adhesion (43) |
| Protein_phosphorylation (26) | Protein_catabolism (84) | Cell_communication (40) |
| Cell_adhesion (18) | Biological_process (74) | Protein_catabolism (34) |
| Virus (17) | Cell_communication (71) | Protein_phosphorylation (29) |
| ... | ... | ... |
The 20 most frequent theme classes are shown for each of the Regulation, Positive_regulation and Negative_regulation event type. Note that events of Regulation type are allowed to be annotated with another event as their theme.
Distribution of cause classes for Regulation events
| Protein (5,797) |
| *No cause (4,184) |
| Other_organic_compound (2,398) |
| Positive_regulation (1,291) (See Table 6 for breakdown.) |
| DNA (1,045) |
| Lipid (713) |
| Negative_regulation (630) (See Table 6 for breakdown.) |
| Physiological_process (601) |
| Binding (577) |
| Artificial_process (448) |
| Mutagenesis (348) |
| Gene expression (322) |
| ... |
The 12 most frequent cause classes for Regulation (without differentiating Positive_ or Negative_regulation events).
Breakdown of causes in Positive_ and Negative_regulation
| Protein (405) | Protein (200) |
| Gene_expression (218) | Positive_regulation (69) |
| DNA (46) | Gene_expression (27) |
| Protein_amino_acid_phosphorylation (25) | DNA (24) |
| Localization (23) | Localization (19) |
| RNA (18) | Protein_amino_acid_phosphorylation |
| ... | ... |
The Positive_regulation and Negative_regulation events which appear as causes in Table 5 are further classified in a more detail according to their themes.
Semantic role types and their annotation instances
| nuclear (135) | early (15) | electrophoretic mobility shift assays (13) |
| in t cells (106) | subsequent (13) | northern blot analysis (9) |
| in human monocytes (57) | during t-cell activation (8) | electrophoretic mobility shift assay (7) |
| in monocytes (50) | within 30 min (6) | by electrophoretic mobility shift assays (7) |
| in b cells (41) | initial (6) | using electrophoretic mobility shift assays (4) |
| intracellular (38) | during aging (6) | in transient transfection assays (4) |
| in jurkat cells (34) | during monocytic differentiation (5) | in emsas (4) |
| in jurkat t cells (33) | simultaneous (4) | in electrophoretic mobility shift assays (4) |
| in monocytic cells (30) | during the immune response. (4) | site-directed mutagenesis (3) |
| in t lymphocytes (28) | during erythroid differentiation (4) | nuclear run-on experiments (3) |
| surface (27) | at 24 hr (4) | in gel mobility shift assays (3) |
| cells (24) | rapidly (3) | immunoblot analysis (3) |
| cytoplasmic (21) | for 6 hours (3) | gel-shift analysis (3) |
| in these cells (20) | for 6 h (3) | emsa (3) |
| in human t cells (20) | first (3) | cotransfection experiments (3) |
| t cells (18) | during the immune response (3) | by northern blot analysis (3) |
| cellular (18) | during the cell cycle (3) | by flow cytometry (3) |
| in activated t cells (17) | during t cell activation (3) | by electrophoretic mobility shift assay (3) |
| to the nucleus (16) | during myelopoiesis (3) | western blotting (2) |
| in the nucleus (16) | during monocyte differentiation (3) | transient transfection experiments (2) |
| in hela cells (16) | at 8 hr (3) | supershift analysis (2) |
| in thp-1 cells (15) | 24 h (3) | rt-pcr (2) |
| in b lymphocytes (15) | within 8 hr of infection (2) | northern blot analyses (2) |
| b cell (15) | within 6 h (2) | northern analysis (2) |
| in u937 cells (14) | within 4 hours (2) | mutational analysis (2) |
| in fibroblasts (13) | within 20 min (2) | mutational analyses (2) |
| b cells (13) | within 2 h (2) | mobility shift assays (2) |
| transendothelial (12) | second (2) | inhibition studies (2) |
| t-cell (12) | in a primary t cell response (2) | in transient transfection experiments (2) |
| into the nucleus (12) | from day 7 to day 14 of culture (2) | in transient assays (2) |
| ... | ... | ... |
Text expressions providing locational (clueLoc), temporal (clueTime) and experimental (clueExperiment) context where biological events take place. The 30 most frequently observed expressions for each type are listed.
Figure 8Screenshot of XConc Suite. The XConc Suite consists of three plug-ins to Eclipse platform: an XML editor (A), a concordancer (B is the query editor and C is the result view), and an ontology browser (D) which support both the editor and the concordancer for the selection of ontology terms.