| Literature DB >> 30736752 |
Amira Al-Aamri1, Kamal Taha1, Yousof Al-Hammadi1, Maher Maalouf2, Dirar Homouz3.
Abstract
BACKGROUND: Understanding the genetic networks and their role in chronic diseases (e.g., cancer) is one of the important objectives of biological researchers. In this work, we present a text mining system that constructs a gene-gene-interaction network for the entire human genome and then performs network analysis to identify disease-related genes. We recognize the interacting genes based on their co-occurrence frequency within the biomedical literature and by employing linear and non-linear rare-event classification models. We analyze the constructed network of genes by using different network centrality measures to decide on the importance of each gene. Specifically, we apply betweenness, closeness, eigenvector, and degree centrality metrics to rank the central genes of the network and to identify possible cancer-related genes.Entities:
Keywords: Biological NLP; Biomedical literature; Disease-gene association; Genetic network; Text mining
Mesh:
Year: 2019 PMID: 30736752 PMCID: PMC6368766 DOI: 10.1186/s12859-019-2634-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Number of new cases and deaths for each common cancer type from NIH [2]
Description of features for the pair (g1,g2)
| Feature | Biological terms | Text level |
|---|---|---|
|
| Abstract | |
|
| Sentence | |
|
| Semantic | |
|
| Abstract | |
|
| Sentence | |
|
| Semantic | |
|
| Abstract | |
|
| Sentence | |
|
| Semantic |
Each feature measures the number of times the two biological terms are co-occurred over their individual appearance in the level of text
The logit transformation and regularized log-likelihood for both classifiers (WLR and WKLR)
| Model | Logit transformation | Regularized log-likelihood |
|---|---|---|
| WLR |
|
|
| WKLR |
|
|
The detailed description for each equation is reported in “Rare-event classification:” section
Fig. 2ROC curve for Training the data using WLR. TPR is increased at low FPR
Accuracy measures from training a data of pairs of genes using WLR
| Accuracy | AUC | |
|---|---|---|
| Class 0 (unrelated) | 68 |
|
| Class 1 (related) | 68 |
Accuracy measures from training a data of pairs of genes using WKLR
| Accuracy | AUC | |
|---|---|---|
| Class 0 (unrelated) | 71 |
|
| Class 1 (related) | 85 |
Fig. 3Precision-Recall Curve Using WLR
Fig. 4Precision-Recall Curve Using WKLR
Fig. 5The process of network analysis and disease-gene identification
The seed genes retrieved from OMIM
| Prostate | Breast | Lung |
|---|---|---|
| PCAP | RAD54L | FASLG |
| HPC5 | CASP8 | CASP8 |
| MAD1L1 | BARD1 | DLEC1 |
| HPC4 | PIK3CA | RASSF1 |
| HIP1 | HMMR | PIK3CA |
| MSR1 | NQO2 | IRF1 |
| KLF6 | ESR1 | PRKN |
| PTEN | RB1CC1 | EGFR |
| MXI1 | SLC22A1L | BRAF |
| CD82 | TSG101 | MAP3K8 |
| BRCA2 | ATM | ERCC6 |
| CDH1 | KRAS | SLC22A1L |
| ZFHX3 | BRCA2 | PPP2R1B |
| HPCQTL19 | XRCC3 | KRAS |
| HPC3 | AKT1 | ERBB2 |
| CHEK2 | RAD51A | CYP2A6 |
| HPC6 | PALB2 | |
| AR | CDH1 | |
| TP53 | ||
| PHB | ||
| PPM1D | ||
| BRIP1 | ||
| CHEK2 |
The Cancer-related gene-interaction networks properties as reported by Cytoscape
| Diameter | Nodes | cc ∗ | Interactions | |
|---|---|---|---|---|
| Prostate | ||||
| WLR | 9 | 257 | 0.038 | 275 |
| WKLR | 6 | 1808 | 0.086 | 2479 |
| Breast | ||||
| WLR | 8 | 504 | 0.103 | 693 |
| WKLR | 6 | 3126 | 0.161 | 5986 |
| Lung | ||||
| WLR | 7 | 555 | 0.070 | 691 |
| WKLR | 6 | 2355 | 0.067 | 3959 |
∗ cc refers to clustering coefficient
Percentage of top n genes related to lung cancer based on MalaCards database
| Top | Closeness | Betweenness | Degree | Eigenvector |
|---|---|---|---|---|
| 10 | 80.00 | 80.00 | 90.00 | 99.00 |
| 15 | 73.30 | 80.00 | 86.70 | 93.30 |
| 20 | 70.00 | 70.00 | 90.00 | 90.00 |
| 30 | 60.00 | 70.00 | 83.33 | 76.67 |
| 50 | 48.00 | 56.00 | 72.00 | 72.00 |
| 75 | 40.00 | 48.00 | 54.67 | 58.67 |
| 100 | 36.00 | 50.00 | 50.00 | 52.00 |
| 125 | 31.20 | 43.20 | 43.20 | 47.19 |
| 225 | 20.44 | 28.44 | 28.44 | 29.77 |
| 300 | 17.33 | 22.33 | 22.33 | 24.33 |
| 450 | 17.11 | 17.11 | 17.11 | 17.33 |
| 500 | 15.60 | 15.60 | 15.60 | 16.20 |
| 555 | 15.31 | 15.31 | 15.31 | 15.31 |
The precision measures of the top 15 genes by each centrality measure and against MalaCards
| Closeness | Betweenness | Degree | Eigenvector | |
|---|---|---|---|---|
| Prostate | ||||
| WLR | 53.3 |
| 80 | 66.7 |
| WKLR | 46.7 | 80 |
| 66.7 |
| Breast | ||||
| WLR | 80 | 86.7 |
|
|
| WKLR | 46.7 |
|
| 86.7 |
| Lung | ||||
| WLR | 73.3 | 80 | 86.7 |
|
| WKLR | 60 |
|
|
|
The highest precisions are italic
The precision measures of the top 15 genes by each centrality measure and against GDC
| Closeness | Betweenness | Degree | Eigenvector | |
|---|---|---|---|---|
| Prostate | ||||
| WLR |
| 60 | 66.7 |
|
| WKLR | 33.3 |
|
|
|
| Breast | ||||
| WLR | 73.3 | 40 | 53.3 |
|
| WKLR | 46.7 | 66.7 | 66.7 |
|
| Lung | ||||
| WLR | 20 | 20 | 33.3 |
|
| WKLR | 40 | 40 | 40 |
|
The highest precisions are italic
The precision measures of the top 15 genes by each centrality measure and against both GDC and MalaCards
| Closeness | Betweenness | Degree | Eigenvector | |
|---|---|---|---|---|
| Prostate | ||||
| WLR | 93.3 | 93.3 | 93.3 | 86.7 |
| WKLR | 60 | 86.7 | 93.3 | 80 |
| Breast | ||||
| WLR | 80 | 86.7 | 93.3 | 93.3 |
| WKLR | 53.3 | 100 | 100 | 86.7 |
| Lung | ||||
| WLR | 73.3 | 80 | 86.7 | 100 |
| WKLR | 66.67 | 86.7 | 86.7 | 93.3 |
The recall of seed genes in the whole human genome network created by using either WLR or WKLR
| Prostate seeds | Recall | Breast seeds | Recall | Lung seeds | Recall |
|---|---|---|---|---|---|
| WLR | 66.6 | WLR | 100 | WLR | 100 |
| WKLR | 72.2 | WKLR | 100 | WKLR | 100 |
To the left, the Top 30 genes predicted by our system and their relevance to breast-cancer
| Propsed system | Relevant | |
|---|---|---|
| BRCA2 | YES+Seed | TNF |
| ESR1 | YES+Seed | EGFR |
| CDH1 | YES+Seed | CRC |
| BRCA1 | YES | PTEN |
| PPM1D | YES+Seed | IL-6 |
| NQO2 | YES+Seed | AR |
| XRCC3 | YES+Seed | BRCA1 |
| TSG101 | YES+Seed | EGF |
| CDKN2A | candidate | GAPDH |
| PALB2 | YES+Seed | HR |
| BRIP1 | YES+Seed | AML |
| PIK3CA | YES+Seed | CD4 |
| MRE11A | candidate | STAT3 |
| RAD54L | YES+Seed | AD |
| ERBB2 | YES | MMP-9 |
| CHEK2 | YES+Seed | MS |
| RAD51C | candidate | RD |
| AKT1 | YES+Seed | MYC |
| TP53 | YES+Seed | S6 |
| RB1CC1 | YES+Seed | TP53 |
| RB1 | YES | ATM |
| HMMR | YES+Seed | IL-8 |
| STK11 | YES | AP1 |
| BARD1 | YES+Seed | MMP-2 |
| RAD51 | YES | GC |
| KRAS | YES+Seed | FBS |
| RAD50 | candidate | ES |
| ATM | YES+Seed | RA |
| BACH1 | Seed | CXCR4 |
| CASP8 | YES+Seed | BRCA2 |
To the right, a list of the Top 30 genes predicted by Quan & Ren
Fig. 6The prediction is made over several thresholds. As the threshold increases, fewer pairs are assigned to the positive class
A comparison for the precision of the top 10 ranked genes by each centrality measure and by each approach
| Closeness | Betweenness | Degree | |
|---|---|---|---|
| CGDA [ | 70 | 90 | 80 |
| EDC-EDC [ | 77.3 | 86.4 | 82.8 |
| MCforGN [ | 78 | 83 | 82 |
|
| 80 | 80 | 80 |