| Literature DB >> 24225062 |
Hee-Jin Lee, Sang-Hyung Shim, Mi-Ryoung Song, Hyunju Lee, Jong C Park1.
Abstract
BACKGROUND: In order to access the large amount of information in biomedical literature about genes implicated in various cancers both efficiently and accurately, the aid of text mining (TM) systems is invaluable. Current TM systems do target either gene-cancer relations or biological processes involving genes and cancers, but the former type produces information not comprehensive enough to explain how a gene affects a cancer, and the latter does not provide a concise summary of gene-cancer relations.Entities:
Mesh:
Year: 2013 PMID: 24225062 PMCID: PMC3833657 DOI: 10.1186/1471-2105-14-323
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Event annotations in the CG corpus. We used the brat rapid annotation tool [27] for visualization of event annotations.
Annotation concept values and their definitions
| CGE | increased | Expression level of the gene is increased. |
| decreased | Expression level of the gene is decreased. | |
| CCS | normal- >normal | The cell or tissue remains as normal after the change in the gene’s expression level. |
| normal- >cancer | The cell or tissue acquires cancerous properties as the gene expression level changes; some cancerous properties of the cell or tissue are strengthened as the gene expression level changes. | |
| cancer- >cancer | There’s no change in the cancerous properties of the cell or tissue despite the change in the expression level of the gene. | |
| cancer- >normal | The cell or tissue loses some cancerous properties as the gene expression level changes; some cancerous properties of the cell or tissue are weakened as the gene expression level changes. | |
| unidentifiable | The information about whether or not the gene expression level change accompanies cell or tissue state change is not provided. | |
| IGE | up-regulated | The initial gene expression level is higher than the expression level of the gene in the normal state. |
| down-regulated | The initial gene expression level is lower than the expression level of the gene in the normal state. | |
| unchanged | The initial gene expression level is comparable to the expression level of the gene in the normal state. | |
| unidentifiable | The information about the initial gene expression level is not provided. | |
| PT | observation | Cell or tissue change accompanied by the gene expression level change is reported as observed but the causality between the two is not claimed. |
| causality | The causality between the gene expression level change and the cell or tissue change is claimed. |
In the definitions, 'normal’ state of cells or tissues refers to the state in which the cells or tissues show no cancerous properties.
Example annotations with inferred gene classes
| [ | inc. | n- >c | unc. | obs. | Biomarker(by rule 7) |
| [ | dec. | n- >c | uni. | cau. | Tumor suppressor gene(by rule 4) |
| For example, some studies showed that CLU expression is increased in advanced stages of prostate cancer and that [ | dec. | c- >n | up. | cau. | Oncogene(by rule 3) |
In the example sentences, gene names, cancer-related terms and the keywords for gene expression change are noted in square brackets and marked with subscripts 'g’, 'c’ and 'e’, respectively. The annotation concept values are abbreviated for brevity.
Inference rules for gene classification
| 1 | increased | normal- >cancer | * | causality | oncogene |
| 2 | decreased | cancer- >normal | unidentifiable | causality | oncogene |
| 3 | decreased | cancer- >normal | up-regulated | * | oncogene |
| 4 | decreased | normal- >cancer | * | causality | tumor suppressor gene |
| 5 | increased | cancer- >normal | unidentifiable | causality | tumor suppressor gene |
| 6 | increased | cancer- >normal | down-regulated | * | tumor suppressor gene |
| 7 | * | normal- >cancer | * | observation | biomarker |
| 8 | * | cancer- >normal | unidentifiable | observation | biomarker |
| 9 | decreased | cancer- >cancer | up-regulated | observation | biomarker |
| 10 | increased | cancer- >cancer | down-regulated | observation | biomarker |
The rules are not symmetric to each other. For instance, Rule 4 states that a gene can be classified as a 'tumor suppressor gene’ when a decreased expression level of the gene causes cancer progression, regardless of the IGE value. Rule 2 is about a similar case, where a decreased expression level of a gene causes cancer regression. However, Rule 2 requires IGE to be 'unidentifiable’ to infer the gene as an 'oncogene’. Rule 3, not Rule 2, covers the case where CGE is 'up-regulated’ and PT is 'causality’. Also, Rule 2 does not cover the cases where IGE is 'down-regulated’ or 'unchanged’ since we expect such cases to be rarely reported in biology articles (cf. Section on Inference rule development). Rules 5 and 8 are designed in a similar way. The asterisk denotes all the pre-defined values for the corresponding concept.
The sizes of corpora about genes and diseases
| CoMAGC | 408 | 26177 | 821 sets of fourannotation concepts |
| Craven | 1677 | 333845 | 829 gene-diseasepairs |
| PolsySearch | 522 | 116380 | 341 gene-diseasepairs |
| GETM | 150 | 38355 | 267 gene expression-anatomical locationpairs |
| MLEE | 262 | 56588 | 6677 events |
| ID | 30 | 153153 | 4150 events |
| CG | 600 | 129878 | 17248 events |
All the corpora contain PubMed abstracts, except for the ID corpus which contains full text documents. For the Craven and PolySearch corpora, we show the number of positive gene-disease pairs only.
Distribution of the annotation concept values after adjudication
| Prostate | 206(66%) | 104(34%) | 122(39%) | 62(20%) | 126(41%) | 1(1%) | 63(34%) | 120(65%) | 115(63%) | 69(38%) |
| Breast | 177(69%) | 78(31%) | 121(47%) | 34(13%) | 100(39%) | 1(1%) | 58(37%) | 96(62%) | 101(65%) | 54(35%) |
| Ovarian | 184(72%) | 72(28%) | 154(60%) | 25(10%) | 77(30%) | 0(0%) | 91(51%) | 88(49%) | 138(77%) | 41(23%) |
| Total | 567(69%) | 254(31%) | 397(48%) | 121(15%) | 303(37%) | 2(0%) | 212(41%) | 304(59%) | 354(68%) | 164(32%) |
The number of annotation units, to which each annotation concept value is assigned, is shown. The annotation concept values are abbreviated for brevity.
Correlation among annotation concepts
| | | |||
| | | n- >c | c- >n | un. |
| inc. | 318 (2.65) | 45 (-4.22) | 204 (-0.36) | |
| dec. | 79 (-3.95) | 76 (6.30) | 99 (0.54) | |
| | | | ||
| | | n- >c | c- >n | |
| up. | 0 (-1.24) | 2 (2.24) | | |
| unc. | 212 (3.89) | 0 (-7.04) | | |
| uni. | 185 (-3.14) | 119 (5.69) | | |
| | | |||
| | | n- >c | c- >n | |
| ob. | 311 (2.41) | 43 (-4.36) | | |
| cau. | 86 (-3.54) | 78 (6.41) | | |
| | | |||
| | | up. | unc. | uni. |
| ob. | 0 (-1.17) | 191 (3.83) | 163 (-3.10) | |
| cau. | 2 (1.72) | 21 (-5.63) | 141 (4.56) | |
Pairwise contingency matrices of the annotation concepts are shown. The analyses of correlation were performed by the χ2 test. Numbers in the parentheses are Pearson residuals. The annotation concept values are abbreviated for brevity.
Distribution of applied inference rules
| 1 | 21(7%) | 26(10%) | 22(9%) | 69(8%) |
| 2 | 21(7%) | 13(5%) | 9(4%) | 43(5%) |
| 3 | 1(0%) | 1(0%) | 0(0%) | 2(0%) |
| 4 | 5(2%) | 6(2%) | 6(2%) | 17(2%) |
| 5 | 21(7%) | 8(3%) | 4(2%) | 33(4%) |
| 6 | 0(0%) | 0(0%) | 0(0%) | 0(0%) |
| 7 | 96(31%) | 89(35%) | 126(49%) | 311(38%) |
| 8 | 19(6%) | 12(5%) | 12(5%) | 43(5%) |
| 9 | 0(0%) | 0(0%) | 0(0%) | 0(0%) |
| 10 | 0(0%) | 0(0%) | 0(0%) | 0(0%) |
| No rule applied | 126(41%) | 100(39%) | 77(30%) | 303(37%) |
| Total | 310(100%) | 255(100%) | 256(100%) | 821(100%) |
The number of annotation units to which each inference rule is applied is shown.
Oncogenes and tumor suppressor genes in CoMAGC
| Prostate cancer | FGF6
| EAF2
|
| Breast cancer | TGFB1
| RBP1
|
| Ovarian cancer | PLAU
| BRCA1
|
Genes marked with superscripts u, c, t and v are validated with UniprotKB [31], Cancer Genes database [35], TSGene [33] and the cancer gene list by Vogelstein and colleagues [34], respectively.
The four main annotation phases
| Pilot | 43 | Prostate (43) | MEDLINE abstracts linked to DDPC |
| Phase 1 | 237 | Prostate (237) | MEDLINE abstracts |
| Phase 2 | 451 | Breast (225), ovarian (226) | MEDLINE abstracts |
| Phase 3 | 90 | Prostate (30), breast (30), ovarian (30) | MEDLINE abstracts |
| Total | 821 | Prostate (310), breast (255), ovarian (256) | MEDLINE abstracts |
IAA values
| Pilot | 1.00 | 1.00 | 1.00 | 0.73 | 0.34 | 0.66 | 1.00 | 1.00 | 1.00 | 0.83 | 0.63 | 0.67 |
| Phase 1 | 1.00 | 0.99 | 0.99 | 0.81 | 0.71 | 0.76 | 0.95 | 0.48 | 0.93 | 0.90 | 0.64 | 0.81 |
| Phase 2 - breast | 0.99 | 0.98 | 0.98 | 0.75 | 0.60 | 0.69 | 0.96 | 0.00 | 0.95 | 0.84 | 0.65 | 0.69 |
| Phase 2 - ovarian | 1.00 | 0.99 | 0.99 | 0.80 | 0.63 | 0.75 | 1.00 | N/A | 1.00 | 0.88 | 0.62 | 0.75 |
| Phase 2 | 0.99 | 0.98 | 0.99 | 0.78 | 0.62 | 0.72 | 0.98 | 0.00 | 0.97 | 0.86 | 0.64 | 0.72 |
| Phase 3 - prostate | 1.00 | 1.00 | 1.00 | 0.83 | 0.73 | 0.79 | 1.00 | N/A | 1.00 | 0.67 | 0.18 | 0.33 |
| Phase 3 - breast | 1.00 | 1.00 | 1.00 | 0.80 | 0.62 | 0.75 | 1.00 | N/A | 1.00 | 1.00 | 1.00 | 1.00 |
| Phase 3 - ovarian | 1.00 | 1.00 | 1.00 | 0.90 | 0.82 | 0.88 | 1.00 | N/A | 1.00 | 0.93 | 0.63 | 0.87 |
| Overall | 0.99 | 0.99 | 0.99 | 0.79 | 0.64 | 0.73 | 0.98 | 0.00 | 0.97 | 0.86 | 0.64 | 0.73 |
Inference rule validation results
| 1 | 4 | 0 | 1 | 5 | 0.8 |
| 2 | 3 | 0 | 0 | 3 | 1 |
| 3 | 1 | 0 | 1 | 2 | 0.5 |
| 4 | 2 | 0 | 0 | 2 | 1 |
| 5 | 0 | 0 | 0 | 0 | n/a |
| 6 | 0 | 0 | 0 | 0 | n/a |
| 7 | 28 | 1 | 1 | 30 | 0.97 |
| 8 | 6 | 2 | 0 | 8 | 1 |
| 9 | 0 | 0 | 0 | 0 | n/a |
| 10 | 0 | 0 | 0 | 0 | n/a |
| No rule applied | 40 | 0 | 2 | 42 | 0.95 |
| Total | 82 | 3 | 5 | 92 | 0.95 (micro), 0.89 (macro) |
A gene class inferred by an inference rule is compared with two gene classes annotated by each of the two annotators. When the three gene classes are all the same, we refer to the case as a full match. When the inferred gene class agrees with only one of the two annotated gene classes, we refer to the case as a one match. When the inferred gene class is different from both of the annotated gene classes, it is a no match. Agreement rate is calculated as the ratio of full match and one match among the total cases.
Instructions on the allowed or disallowed inference types during annotation
| 1 | Annotators can interpret the sentences and annotate concepts in a 'conventional way’, in which the sentences would usually be interpreted by human readers. |
| 2 | Annotators can infer information using their prior knowledge about properties of cancer cells when the sentence is about comparison of two different cancer cells of the same cancer type. |
| 3 | Annotators can infer information utilizing linguistic clues. |
| 4 | Annotators should not infer information using their prior knowledge about the functions of genes. |
| 5 | Annotators should not infer the CCS value from the information about patients’ survival rates because progression of cancer cells is not the sole factor that contributes to patient survival or death. |
| 6 | Annotators need not consider the certainty level of propositions. |
See Additional file 1 for more details on the instructions.