| Literature DB >> 19458778 |
Shanfeng Zhu1, Yasushi Okuno, Gozoh Tsujimoto, Hiroshi Mamitsuka.
Abstract
An important issue in current medical science research is to find the genes that are strongly related to an inherited disease. A particular focus is placed on cancer-gene relations, since some types of cancers are inherited. As biomedical databases have grown speedily in recent years, an informatics approach to predict such relations from currently available databases should be developed. Our objective is to find implicit associated cancer-genes from biomedical databases including the literature database. Co-occurrence of biological entities has been shown to be a popular and efficient technique in biomedical text mining. We have applied a new probabilistic model, called mixture aspect model (MAM) [48], to combine different types of co-occurrences of genes and cancer derived from Medline and OMIM (Online Mendelian Inheritance in Man). We trained the probability parameters of MAM using a learning method based on an EM (Expectation and Maximization) algorithm. We examined the performance of MAM by predicting associated cancer gene pairs. Through cross-validation, prediction accuracy was shown to be improved by adding gene-gene co-occurrences from Medline to cancer-gene cooccurrences in OMIM. Further experiments showed that MAM found new cancer-gene relations which are unknown in the literature. Supplementary information can be found at http://www.bic.kyotou.ac.jp/pathway/zhusf/CancerInformatics/Supplemental2006.html.Entities:
Keywords: Cancer gene discovery; Cancer genetics; Machine learning; Probabilistic model; Text mining
Year: 2007 PMID: 19458778 PMCID: PMC2675505
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
The size of co-occurrence datasets.
| Item | Size |
|---|---|
| gene type | 2,017 |
| gene-gene | 3,118 |
| cancer type | 21 |
| cancer-cancer | 206 |
| cancer-gene | 3,743 |
The ratio of positive pairs in gene-gene co-occurrence dataset.
| # co-occurrences | -(random) | > = 1 | >1 | > 2 | > 3 | > 4 | > 5 | > 6 |
|---|---|---|---|---|---|---|---|---|
| Dataset size | 3,118 | 3,118 | 758 | 379 | 276 | 152 | 122 | 99 |
| Positive ratio (%) | 26.65 | 57.86 | 64.64 | 68.34 | 69.91 | 70.2 | 72.13 | 76.77 |
AUCs and t-values (in parenthesis) obtained by 50 rounds of cross-validation on cancer-gene pairs.
| Model | |||
|---|---|---|---|
| 3:1 | 1:1 | 1:3 | |
| 3MAM
| |||
| 2MAM
| 75.8
| ||
| 2MAM
| 73.9
| 68.3
| |
| AM
| 70.5
| 64.9
| |
Figure 1:Cumulative number of positive examples with higher log-likelihoods.
20 Cancer-gene pairs with highest log-likelihoods that are not in our training dataset.
| Cancer Type | Gene Name | Log-likelihood |
|---|---|---|
| OVARY | TP53 | − 3.078 |
| COLORECTAL | BCL2 | − 3.085 |
| STOMACH | TP53 | − 3.113 |
| LEUKEMIA | CDKN1A | − 3.176 |
| LYMPHOMA | BAX | − 3.191 |
| PANCREAS | TP53 | − 3.199 |
| BREAST | NFKB1 | − 3.222 |
| THYROID | TP53 | − 3.234 |
| LYMPHOMA | TNF | − 3.235 |
| LUNG | BCL2 | − 3.244 |
| BREAST | BCL2 | − 3.266 |
| KIDNEY | TP53 | − 3.269 |
| BREAST | TNF | − 3.293 |
| LEUKEMIA | TNF | − 3.300 |
| COLORECTAL | TNF | − 3.312 |
| LYMPHOMA NF | NFKB1 | − 3.316 |
| LUNG | TNF | − 3.323 |
| COLORECTAL | CASP8 | − 3.330 |
| LEUKEMIA | NFKB1 | − 3.336 |
| BRAIN | BCL2 | − 3.340 |