| Literature DB >> 18586725 |
Arzucan Ozgür1, Thuy Vu, Günes Erkan, Dragomir R Radev.
Abstract
MOTIVATION: Understanding the role of genetics in diseases is one of the most important aims of the biological sciences. The completion of the Human Genome Project has led to a rapid increase in the number of publications in this area. However, the coverage of curated databases that provide information manually extracted from the literature is limited. Another challenge is that determining disease-related genes requires laborious experiments. Therefore, predicting good candidate genes before experimental analysis will save time and effort. We introduce an automatic approach based on text mining and network analysis to predict gene-disease associations. We collected an initial set of known disease-related genes and built an interaction network by automatic literature mining based on dependency parsing and support vector machines. Our hypothesis is that the central genes in this disease-specific network are likely to be related to the disease. We used the degree, eigenvector, betweenness and closeness centrality metrics to rank the genes in the network.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18586725 PMCID: PMC2718658 DOI: 10.1093/bioinformatics/btn182
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
The prostate cancer seed genes retrieved from OMIM Morbid Map
| Gene | Description |
|---|---|
| AR | Androgen Receptor |
| BRCA2 | Breast cancer 2, early onset |
| MSR1 | Macrophage scavenger receptor 1 |
| EPHB2 | EPH receptor B2 |
| KLF6 | Kruppel-like factor 6 |
| MAD1L1 | MAD1 mitotic arrest deficient-like 1 (yeast) |
| HIP1 | Huntingtin interacting protein 1 |
| CD82 | CD82 molecule |
| ELAC2 | ElaC homolog 2 ( |
| MXI1 | MAX interactor 1 |
| PTEN | Phosphatase and tensin homolog |
| RNASEL | Ribonuclease L (2′, 5′-oligoisoadenylate synthetase-dependent) |
| HPC1 | Hereditary prostate cancer 1 |
| CHEK2 | CHK2 checkpoint homolog ( |
| PCAP | Predisposing for prostate cancer |
Fig. 1The dependency tree of the sentence ‘These results suggest KCC3 is a new member of the KCC family that is under distinct regulation from KCC1’.
Training data sets
| Data set | Sentences | +Sentences | − Sentences |
|---|---|---|---|
| AIMED | 4026 | 951 | 3075 |
| CB | 4056 | 2202 | 1854 |
Percentage of top n genes associated with prostate cancer based on the PGDB
| Top | Degree | Eigenvector | Betweenness | Closeness | Baseline |
|---|---|---|---|---|---|
| 10 | 80.00 | 80.00 | 90.00 | 70.00 | 50.00 |
| 20 | 75.00 | 80.00 | 70.00 | 55.00 | 45.00 |
| 30 | 60.00 | 63.33 | 63.33 | 56.67 | 43.33 |
| 40 | 55.00 | 57.50 | 52.50 | 47.50 | 32.50 |
| 50 | 46.00 | 50.00 | 48.00 | 42.00 | 28.00 |
| 75 | 33.33 | 36.00 | 34.67 | 33.33 | 34.67 |
| 100 | 26.00 | 28.00 | 26.00 | 27.00 | 27.00 |
| 125 | 23.20 | 25.60 | 23.20 | 23.30 | 22.40 |
| 150 | 20.67 | 22.00 | 20.00 | 20.00 | 18.67 |
| 175 | 18.29 | 20.57 | 18.29 | 18.29 | 17.14 |
| 200 | 17.50 | 19.00 | 18.50 | 17.00 | 15.00 |
| 226 | 17.70 | 17.70 | 17.70 | 17.70 | 13.27 |
Genes inferred by degree, eigenvector, closeness and betweenness centralities
| Gene | Degree | Eigenvector | Closeness | Betweenness | Evidence | |
|---|---|---|---|---|---|---|
| TP53 | + | + | + | + | PGDB | |
| BRCA1 | + | + | + | + | PGDB | |
| EREG | + | + | + | + | None | |
| AKT1 | + | + | + | + | PGDB | |
| MAPK1 | + | + | + | + | Literature | (Hao |
| TNF | + | + | + | + | PGDB | |
| CCND1 | + | + | + | + | PGDB | |
| MYC | + | + | + | + | PGDB | |
| APC | + | + | − | − | PGDB | |
| CDKN1B | + | + | + | − | PGDB | |
| MAPK8 | + | + | + | + | PGDB | |
| NR3C1 | − | + | + | − | Literature | (Wei |
| VEGFA | + | + | + | − | PGDB | |
| MDM2 | + | + | + | − | KEGG and Literature | (Wang |
| POLD1 | − | − | + | + | None | |
| SNORA62 | − | − | + | + | None | |
| CNTN2 | − | − | − | + | None | |
| PPA1 | − | − | − | + | None | |
| TMEM37 | − | − | + | − | None | |
| FZR1 | − | − | + | − | PGDB | |
| SSSCA1 | − | − | + | − | None | |
| BCL2 | + | − | − | − | PGDB | |
| INS | + | − | − | − | KEGG and Literature | (Ho |
‘+’indicates that the given gene is found by the centrality method with score ranking within the top 20 and ‘−’ indicates that the gene is not among the top 20 genes inferred by the method. Evidences for each gene-disease relationship are confirmed by using PGDB database, KEGG pathway for prostate cancer and articles indexed in PubMed.
Gene names normalized by Hugo and their description
| Gene | Description |
|---|---|
| TP53 | Tumor protein p53 (Li-Fraumeni syndrome) |
| BRCA1 | Breast cancer 1, early onset |
| EREG | Epiregulin |
| AKT1 | V-akt murine thymoma viral oncogene homolog 1 |
| MAPK1 | Mitogen-activated protein kinase 1 |
| TNF | Tumor necrosis factor (TNF superfamily, member 2) |
| CCND1 | Cyclin D1 |
| MYC | V-myc myelocytomatosis viral oncogene homolog (avian) |
| APC | Adenomatosis polyposis coli |
| CDKN1B | Cyclin-dependent kinase inhibitor 1B (p27, Kip1) |
| MAPK8 | Mitogen-activated protein kinase 8 |
| NR3C1 | Nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) |
| VEGFA | Vascular endothelial growth factor A |
| MDM2 | Mouse double minute 2, human homolog of; p53-binding protein |
| POLD1 | Polymerase (DNA directed), delta 1, catalytic subunit 125kDa |
| SNORA62 | Small nucleolar RNA, H/ACA box 62 |
| CNTN2 | Contactin 2 (axonal) |
| PPA1 | Pyrophosphatase (inorganic) 1 |
| TMEM37 | Transmembrane protein 37 |
| FZR1 | Fizzy/cell division cycle 20 related 1 ( |
| SSSCA1 | Sjogren's syndrome/scleroderma autoantigen 1 |
| BCL2 | B-cell CLL/lymphoma |
| INS | Insulin |
Definitions used in the evaluation of the top 20 genes
| term | definition |
|---|---|
| Seed gene: | A gene, which is one of the prostate cancer genes retrieved from OMIM Morbid Map (i.e. one of the genes in |
| Inferred gene: | A non-seed gene |
| Percentage of inferred genes: | (Number of inferred genes / 20) × 100 |
| Confirmed inferred gene: | An inferred gene found to be related to prostate cancer based on PGDB, KEGG pathway for prostate cancer and published articles |
| Percentage of confirmed inferred genes: | (Number of confirmed inferred genes / Number of inferred genes) × 100 |
| Percentage of confirmed genes: | ((Number of confirmed inferred genes + Number of seed genes) / 20) × 100 |
Summary of the results for the top 20 genes
| Degree | Eigenvector | Betweenness | Closeness | Baseline | |
|---|---|---|---|---|---|
| Number of seed genes | 5 | 6 | 7 | 2 | 3 |
| Number of inferred genes | 15 | 14 | 13 | 18 | 17 |
| Percentage of inferred genes | 75 | 70 | 65 | 90 | 85 |
| Number of confirmed inferred genes | 14 | 13 | 8 | 13 | 10 |
| Percentage of confirmed inferred genes | 93.33 | 92.86 | 61.54 | 72.22 | 58.82 |
| Percentage of confirmed genes | 95 | 95 | 75 | 75 | 65 |