| Literature DB >> 29860481 |
Samir Gupta1, Hayley Dingerdissen2, Karen E Ross3, Yu Hu2, Cathy H Wu1,4, Raja Mazumder2, K Vijay-Shanker1.
Abstract
Gene expression levels affect biological processes and play a key role in many diseases. Characterizing expression profiles is useful for clinical research, and diagnostics and prognostics of diseases. There are currently several high-quality databases that capture gene expression information, obtained mostly from large-scale studies, such as microarray and next-generation sequencing technologies, in the context of disease. The scientific literature is another rich source of information on gene expression-disease relationships that not only have been captured from large-scale studies but have also been observed in thousands of small-scale studies. Expression information obtained from literature through manual curation can extend expression databases. While many of the existing databases include information from literature, they are limited by the time-consuming nature of manual curation and have difficulty keeping up with the explosion of publications in the biomedical field. In this work, we describe an automated text-mining tool, Disease-Expression Relation Extraction from Text (DEXTER) to extract information from literature on gene and microRNA expression in the context of disease. One of the motivations in developing DEXTER was to extend the BioXpress database, a cancer-focused gene expression database that includes data derived from large-scale experiments and manual curation of publications. The literature-based portion of BioXpress lags behind significantly compared to expression information obtained from large-scale studies and can benefit from our text-mined results. We have conducted two different evaluations to measure the accuracy of our text-mining tool and achieved average F-scores of 88.51 and 81.81% for the two evaluations, respectively. Also, to demonstrate the ability to extract rich expression information in different disease-related scenarios, we used DEXTER to extract information on differential expression information for 2024 genes in lung cancer, 115 glycosyltransferases in 62 cancers and 826 microRNA in 171 cancers. All extractions using DEXTER are integrated in the literature-based portion of BioXpress.Database URL: http://biotm.cis.udel.edu/DEXTER.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29860481 PMCID: PMC6007211 DOI: 10.1093/database/bay045
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.System pipeline overview.
Figure 2.Example SDG.
Components extracted from Type A sentences (Example 6)
| Sentence # | Scale indicator | Compared aspect | Compared entity 1 | Compared entity 2 |
|---|---|---|---|---|
| (SI) | (CA) | (CE1) | (CE2) | |
| 1 | Higher | Plasma miR-187 | OSCC patients | Normal individuals |
| 2 | Lower | miR-181a expression | HepG2 cells | Hep3B cells |
| 3 | Higher | miR-210 expression | Metastatic tumors | Primary tumors |
| 4 | Increased | miR-95 levels | Human prostate cancer specimens | Normal tissues |
| 5 | Increased | TP expression | Ovarian cancers | Normal ovaries |
| 6 | Decreased | FOXD3 expression | HGG tissues | Normal brain tissues |
| 7 | Lower | Expression level of PTEN mRNA | Patients with CLL | Controls |
Figure 3.Comparison SDG example.
Components extracted from Type B sentences (Example 7)
| Sentence # | Level indicator | Expressed aspect | Expression location | Implicit comparison |
|---|---|---|---|---|
| (LI) | (EA) | (EL) | ||
| 1 | Over-expressed | GALNT2 | OSCC | Yes |
| 2 | Higher | IGF1R expression levels | Right adrenocortical tumor | Yes |
| 3 | Low | Levels of miR-373 expression | Pancreatic cancer cell lines | No |
| 4 | Higher | Higher level of BRF2 expression | NSCLC tissues | Yes |
| 5 | High | TRIM32 expression levels | Gastric cancer tissues | No |
| 6 | High | Expression levels of FKBP51 | Melanoma cells | No |
Figure 4.(a) Type B SDG Example 1. (b) Type B SDG Example 2.
Large-scale processing results
| # abstracts processed | # of abstracts extracted | # of entries | # of expressed genes | ||||
|---|---|---|---|---|---|---|---|
| Type A | Type B | Type A | Type B | Type A | Type B | ||
| Lung cancer set | 88 431 | 742 | 1 448 | 985 | 2019 | 642 | 1383 |
| Glycosyltransferases set | 27 516 | 90 | 180 | 106 | 236 | 42 | 73 |
| microRNA set | 28 067 | 1650 | 3575 | 2522 | 6437 | 477 | 721 |
Figure 5.Top 10 genes whose expression is associated with lung cancer types in the literature.
Figure 6.Top 10 GTs whose expression is associated with cancer types in the literature.
Figure 7.Top 10 microRNAs whose expression is associated with cancer types in the literature.
Figure 8.DEXTER web interface’s search results for the query ‘egfr AND lung cancer’.
BioXpress-based evaluation results
| True positive | False positive | False negative | Precision | Recall | |
|---|---|---|---|---|---|
| 77 | 5 | 15 | 93.90 | 83.69 | 88.51 |
Second evaluation results
| True positive | False positive | False negative | Precision | Recall | |
|---|---|---|---|---|---|
| 126 | 13 | 43 | 90.06 | 74.56 | 81.81 |