| Literature DB >> 24839601 |
À Bravo1, M Cases1, N Queralt-Rosinach1, F Sanz1, L I Furlong1.
Abstract
The biomedical literature represents a rich source of biomarker information. However, both the size of literature databases and their lack of standardization hamper the automatic exploitation of the information contained in these resources. Text mining approaches have proven to be useful for the exploitation of information contained in the scientific publications. Here, we show that a knowledge-driven text mining approach can exploit a large literature database to extract a dataset of biomarkers related to diseases covering all therapeutic areas. Our methodology takes advantage of the annotation of MEDLINE publications pertaining to biomarkers with MeSH terms, narrowing the search to specific publications and, therefore, minimizing the false positive ratio. It is based on a dictionary-based named entity recognition system and a relation extraction module. The application of this methodology resulted in the identification of 131,012 disease-biomarker associations between 2,803 genes and 2,751 diseases, and represents a valuable knowledge base for those interested in disease-related biomarkers. Additionally, we present a bibliometric analysis of the journals reporting biomarker related information during the last 40 years.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24839601 PMCID: PMC4009255 DOI: 10.1155/2014/253128
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Text mining workflow.
Figure 2An example of the variability in terminology for genes depending on the primary sources.
Contents and statistics of gene, disease, and biomarker-gene and disease-biomarker dictionaries.
| Dictionary | Number of concepts | Number of terms | Ambiguity | Variability |
|---|---|---|---|---|
| Gene | 50,090 | 545,519 | 1.51 | 5.98 |
| Gene curated/extended | 50,090 | 649,414 | 1.46 | 12.96 |
| Disease | 79,781 | 378,616 | 1.01 | 4.14 |
| Disease curated/extended | 74,073 | 294,371 | 1.02 | 3.97 |
| Biomarker-specific gene | 3,533 | 89,236 | 1.27 | 25.26 |
| Biomarker-specific disease | 3,122 | 35,686 | 1.05 | 11.43 |
Examples of sentences including disease-biomarker cooccurrences.
| Disease (CUI)a | Biomarker (Gene ID)b | PMID | Sentence |
|---|---|---|---|
| Hodgkin's lymphoma (C0019829) | Anti-Mullerian hormone | 17726078 |
|
| TCC (C1861305) | CK20 (54474) | 17397492 (2007) |
|
| Autism (C0004352) | Brain-derived neurotrophic factor (BDNF) (627) | 19119429 | To investigate levels of |
| Acute kidney injury (AKI) (C0022660) | Neutrophil gelatinase-associated lipocalin (NGAL) (3934), netrin-1 (9423) | 21740336 |
|
| Chronic heart failure (C0264716) | Cardiac troponin I (7137) | 21751783 | Top-down quantitative proteomics identified phosphorylation of |
| Adenocarcinomas (C0001418) | MOC-31 (4072) | 21732548 |
|
| Lung adenocarcinoma (C0152013) | ROM (6094) | 21748260 | Hence, serum |
aConcept unique identifier at UMLS.
bNCBI gene identifier.
Figure 3Number of publications (bars) and number of journals (line) by year.
Figure 4The top 10 journals sorted by unique disease-biomarker cooccurrences identified.
The top 10 disease-biomarker associations. Disease-biomarker associations were ranked according to ScoreDB (see Section 2 for more details). The complete list of the associations is available at http://ibi.imim.es/biomarkers/.
| Score | Disease name (CUIa) | Gene symbol (Gene IDb) | Number of abstracts |
|---|---|---|---|
| 4076.42 | NEOPLASM (C0027651) | TP53 (7157) | 3,042 |
| 3930.25 | NEOPLASM (C0027651) | ERBB2 (2064) | 2,582 |
| 3441.32 | NEOPLASM (C0027651) | CEACAM5 (1048) | 2,234 |
| 2733.92 | IMMUNODEFICIENCY DISORDER (C0021051) | CD4 (920) | 1,548 |
| 2546.27 | NEOPLASM (C0027651) | EGFR (1956) | 1,710 |
| 2028.21 | LEUKEMIA (C0023418) | CD34 (947) | 1,071 |
| 1988.57 | NEOPLASM (C0027651) | ESR1 (2099) | 1,179 |
| 1943.15 | NEOPLASM (C0027651) | AFP (174) | 1,169 |
| 1915.15 | NEOPLASM (C0027651) | CD34 (947) | 1,108 |
| 1836.03 | MALIGNANT NEOPLASTIC DISEASE (C0006826) | KLK3 (354) | 936 |
aConcept unique identifier at UMLS.
bNCBI gene identifier.
Figure 5Associations analysis. (a) Boxplot showing the score versus number of publications supporting each disease-biomarker association. (b) Distribution of associations based on the number of publications that support each association. The fraction of the associations that were reported in the last three years is highlighted as dark grey bars.
The top 10 genes sorted by the number of associated diseases. The complete list of genes is available at http://ibi.imim.es/biomarkers/.
| Gene symbol | Gene IDa | Gene name | Number of diseases |
|---|---|---|---|
| IL6 | 3569 | Interleukin 6 (interferon, beta 2) | 1,025 |
| TNF | 7124 | Tumor necrosis factor | 1,003 |
| CD4 | 920 | CD4 molecule | 917 |
| ICAM1 | 3383 | Intercellular adhesion molecule 1 | 841 |
| TP53 | 7157 | Tumor protein p53 | 797 |
| CRP | 1401 | C-reactive protein, pentraxin-related | 786 |
| CD8A | 925 | CD8a molecule | 771 |
| CD34 | 947 | CD34 molecule | 742 |
| VEGFA | 7422 | Vascular endothelial growth factor A | 704 |
| ACE | 1636 | Angiotensin I converting enzyme | 666 |
aNCBI gene identifier.
Figure 6Distribution of the number of associated biomarkers (for diseases, (a)) and diseases (for biomarkers, (b)). Gene symbols from HGNC are used for the biomarkers.
The top 10 diseases sorted by the number of associated biomarkers. The complete list of diseases is available at http://ibi.imim.es/biomarkers/.
| Disease name | CUIa | Number of genes |
|---|---|---|
| Neoplasm | C0027651 | 2,033 |
| Malignant neoplastic disease | C0006826 | 1,750 |
| Carcinoma | C0007097 | 1,059 |
| Recurrent malignant neoplasm | C1458156 | 790 |
| Leukemia | C0023418 | 782 |
| Malignant melanoma | C0025202 | 755 |
| Liver cell carcinoma | C2239176 | 723 |
| Congenital deformity | C0000768 | 715 |
| Tumor angiogenesis | C1519670 | 633 |
| Tumor progression | C1519176 | 619 |
aConcept unique identifier at UMLS.
Comparison of disease-biomarkers pairs identified by the text mining (TM) approach with disease-biomarkers annotations in DisGeNET, based on MeSH disease classification [31].
| MeSH | MeSH disease class name | Number of disease-biomarker associations | The number validated with DisGeNET (%) |
|---|---|---|---|
| C01 | Bacterial infections and mycoses | 1,529 | 164 (10.73) |
| C02 | Virus diseases | 3,297 | 302 (9.16) |
| C03 | Parasitic diseases | 590 | 82 (13.90) |
|
|
|
|
|
| C05 | Musculoskeletal diseases | 5,771 | 388 (6.72) |
| C06 | Digestive system diseases | 8,154 | 1,156 (14.18) |
| C07 | Stomatognathic diseases | 2,531 | 195 (7.70) |
| C08 | Respiratory tract diseases | 5,460 | 735 (13.46) |
| C09 | Otorhinolaryngologic diseases | 770 | 40 (5.19) |
|
|
|
|
|
| C11 | Eye diseases | 2,513 | 226 (8.99) |
| C12 | Male urogenital diseases | 5,110 | 666 (13.03) |
| C13 | Female urogenital diseases and pregnancy complications | 6,432 | 863 (13.42) |
|
|
|
|
|
| C15 | Hemic and lymphatic diseases | 7,689 | 948 (12.33) |
|
|
|
|
|
| C17 | Skin and connective tissue diseases | 6,724 | 851 (12.66) |
| C18 | Nutritional and metabolic diseases | 6,314 | 711 (11.26) |
| C19 | Endocrine system diseases | 5,253 | 681 (12.96) |
|
|
|
|
|
| C21 | Disorders of environmental origin | 2 | 0 (0.00) |
| C23 | Pathological conditions, signs, and symptoms | 8,212 | 606 (7.38) |
| C24 | Occupational diseases | 72 | 11 (15.28) |
| F01 | Behavior and behavior mechanisms | 594 | 24 (4.04) |
| F03 | Mental disorders | 2,810 | 613 (21.89) |