| Literature DB >> 29373609 |
Yuji Zhang1,2, Feichen Shen3, Majid Rastegar Mojarad3, Dingcheng Li3, Sijia Liu3, Cui Tao4, Yue Yu3,5, Hongfang Liu3.
Abstract
Recent scientific advances have accumulated a tremendous amount of biomedical knowledge providing novel insights into the relationship between molecular and cellular processes and diseases. Literature mining is one of the commonly used methods to retrieve and extract information from scientific publications for understanding these associations. However, due to large data volume and complicated associations with noises, the interpretability of such association data for semantic knowledge discovery is challenging. In this study, we describe an integrative computational framework aiming to expedite the discovery of latent disease mechanisms by dissecting 146,245 disease-gene associations from over 25 million of PubMed indexed articles. We take advantage of both Latent Dirichlet Allocation (LDA) modeling and network-based analysis for their capabilities of detecting latent associations and reducing noises for large volume data respectively. Our results demonstrate that (1) the LDA-based modeling is able to group similar diseases into disease topics; (2) the disease-specific association networks follow the scale-free network property; (3) certain subnetwork patterns were enriched in the disease-specific association networks; and (4) genes were enriched in topic-specific biological processes. Our approach offers promising opportunities for latent disease-gene knowledge discovery in biomedical research.Entities:
Mesh:
Year: 2018 PMID: 29373609 PMCID: PMC5786305 DOI: 10.1371/journal.pone.0191568
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1(A) Overview of the proposed approach. (B) The Log likelihood score across different number of topics. (C) The Log likelihood score across different iterations.
Fig 2(A) Distribution of diseases and genes across 160 optimal disease topics. (B) The heatmap of cosine similarity for top 10 topics presented at disease level. (C) The heatmap of cosine similarity for top 10 topics presented at gene level. (D) Overall Distribution of 146 LDA Topics on 19 Human Disease Network Categories in Goh et al.
Topics with the most diseases mapped on Human Disease Network Categories.
| Disease Category in Goh et al | Mapped LDA Topic | # Mapped Diseases |
|---|---|---|
| Bone | 96 | 141 |
| Cancer | 107 | 209 |
| Cardiovascular | 159 | 184 |
| Connective tissue disorder | 54 | 238 |
| Dermatological | 1 | 116 |
| Endocrine | 154 | 135 |
| Gastrointestinal | 152 | 133 |
| Hematological | 109 | 296 |
| Immunological | 33 | 129 |
| Metabolic | 39 | 482 |
| Multiple | 27 | 123 |
| Muscular | 114 | 140 |
| Neurological | 34 | 333 |
| Nutritional | 2 | 56 |
| Ophthamological | 52 | 242 |
| Psychiatric | 145 | 77 |
| Renal | 38 | 120 |
| Skeletal | 41 | 110 |
| Unclassified | 39 | 75 |
Top 10 LDA topics containing most OMIM disease-gene associations.
| LDA Topic | Percentage of Disease-Gene Associations overlapped with OMIM |
|---|---|
| 123 | 32.3% |
| 149 | 29.2% |
| 23 | 27.3% |
| 76 | 27.2% |
| 30 | 26.7% |
| 112 | 23.8% |
| 117 | 23.7% |
| 13 | 22.7% |
| 94 | 22.6% |
| 135 | 22.4% |
Fig 3(A) Top 10 topics and their corresponding top 5 diseases based on probabilities. (B) Top 10 topics and their corresponding top 5 genes based on probabilities. For both figures, color blue, red, green, purple, and cyan represent top 1 to 5 diseases/genes respectively.
Fig 4(A) The precision recall curve for top ten topics annotated by three independent disease ontologies. (B) Area under curve (AUC) score for top 10 topics using three independent disease ontologies.
Statistics of top ten disease topics.
| Topic ID | Hub Disease (node degree) | Number of Nodes | Number of Associations | Network Diameter | Characteristic Path Length |
|---|---|---|---|---|---|
| carcinoma, non-small-cell lung (219) | 608 | 10,895 | 5 | 2.55 | |
| squamous cell carcinoma (210) | |||||
| neoplasm metastasis (194) | |||||
| chronic b-cell leukemias (42) | 459 | 2,957 | 6 | 3.01 | |
| cancer of rectum (41) | |||||
| liver neoplasms (38) | |||||
| Asthma (86) | 398 | 2,971 | 6 | 2.82 | |
| lymphoma, large-cell, diffuse (85) | |||||
| chronic lymphocytic leukemia (77) | |||||
| endometrial carcinoma (58) | 330 | 2,293 | 6 | 2.83 | |
| epithelial ovarian cancer (52) | |||||
| malignant neoplasm of endometrium (50) | |||||
| rheumatoid arthritis (76) | 378 | 2,259 | 6 | 2.94 | |
| inflammatory bowel diseases (68) | |||||
| inflammatory disorder (63) | |||||
| salivary gland neoplasms (17) | 244 | 1,058 | 7 | 2.71 | |
| prostatic intraepithelial neoplasias (14) | |||||
| mucinous neoplasm (14) | |||||
| Lymphoma (97) | 377 | 1,883 | 8 | 3.02 | |
| lymphoma, large-cell, diffuse (85) | |||||
| chronic lymphocytic leukemia (77) | |||||
| celiac disease (29) | 265 | 892 | 7 | 3.03 | |
| Sarcoidosis (22) | |||||
| graves disease (21) | |||||
| malignant neoplasm of skin (16) | 231 | 984 | 7 | 2.57 | |
| dysplastic nevus (13) | |||||
| carcinoma in situ of uterine cervix (12) | |||||
| uterine cervical neoplasms (14) | 240 | 799 | 6 | 2.69 | |
| mouse pancreatic intraepithelial neoplasia-2 (12) | |||||
| endometrial adenocarcinoma (11) |
A list of enriched diseases and disorders associated with genes in the AD association network.
| Canonical Pathways | -log(p-value) | Ratio | Molecules |
|---|---|---|---|
| Huntington′s Disease Signaling | 8.95 | 0.06 | BDNF,CREBBP,TBP,NGF,TGM2,HDAC6,GRM5,AKT1,HDAC3,ATP5B,KL,HTT,DLG4,DCTN1,SNCA |
| G-Protein Coupled Receptor Signaling | 6.5 | 0.05 | GRM5,HTR2C,FYN,AKT1,GRK2,KL,CREBBP,PRKAR1B,HTR1A,DRD3,DRD2,ADORA2A,HTR2A |
| Neuropathic Pain Signaling In Dorsal Horn Neurons | 5.37 | 0.08 | GRM5,NTRK2,GPR37,BDNF,KL,GRIN2D,PRKAR1B,ELK1 |
| Parkinson′s Signaling | 5.21 | 0.25 | GPR37,PARK7,PARK2,SNCA |
| Mitochondrial Dysfunction | 4.74 | 0.05 | SOD2,ATP5B,PARK7,LRRK2,HTRA2,PARK2,SNCA,APP,PINK1 |
| Neurotrophin/TRK Signaling | 4.37 | 0.07 | AKT1,NTRK2,BDNF,KL,CREBBP,NGF |
| PEDF Signaling | 4.28 | 0.07 | SOD2,AKT1,BDNF,KL,NGF,ELK1 |
| Serotonin Receptor Signaling | 4.23 | 0.09 | HTR2C,GCH1,SLC6A4,HTR1A,HTR2A |
| Dopamine Receptor Signaling | 4.09 | 0.07 | GCH1,COMT,PRKAR1B,DRD3,DRD2,SLC6A3 |
P Value: B and H multiple testing corrected p-values; Ratio: number of molecules in a given pathway that meet cut criteria, divided by total number of molecules that make up that pathway.