| Literature DB >> 29697361 |
Sunkyu Kim1, Heewon Lee2, Keonwoo Kim1, Jaewoo Kang3,4.
Abstract
BACKGROUND: Embedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields.Entities:
Keywords: Cancer; Deep learning; Distributed representation; Mut2Vec; Mutation embedding
Mesh:
Year: 2018 PMID: 29697361 PMCID: PMC5918431 DOI: 10.1186/s12920-018-0349-7
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1The overview of Mut2Vec Pipeline. Our pipeline is composed of two modules: an embedding module based on Skip-Gram and a vector post-processing module equipped with retrofitting. In our pipeline, we make four mutation embedding models. The first model uses only the Skip-Gram module on mutation profiles, and we call the model basic Mut2Vec. In our Mut2Vec+PI model, the weight matrix in the Skip-Gram model is initialized with PubMed word vectors. In our Mut2Vec+R model, the output vectors of the basic Mut2Vec model is post-processed in the retrofitting module. In our Mut2Vec+PI+R model, both the initialization with PubMed word vectors and the post-processing are applied
Fig. 2An overview of the Skip-Gram model. The Skip-Gram model consists of an input layer, embedding lookup(hidden) layer, and prediction layer. The result of the embedding lookup layer is the distributed representation of the target word
IntOGen data description for three cancer types
| Drivers | Passengers | ||
|---|---|---|---|
| Type | Known | Predicted | Predicted |
| BRCA | 22 | 473 | 13702 |
| CM | 29 | 607 | 16863 |
| LUAD | 23 | 505 | 13929 |
Fig. 3Driver/passenger mutations visualization. Visualization with Principal Component Analysis, shows the clear difference between driver and passenger mutation classes when PubMed information is applied. Red dots represent known driver mutations and blue dots represent sampled predicted passenger mutations. Normalized Mutual Information (NMI) is also calculated based on the results of k-means clustering
Fig. 4Driver/passenger mutations visualization of binary vectors. Visualization with Principal Component Analysis of binary mutation vectors. The binary vectors are made by selecting column vectors of patient-mutation profiles. Because driver mutations tend to frequently appear in cancer profiles of patients, there are many 1s in the driver mutation vectors and the size of the driver mutation vectors are large. Therefore, some of driver mutation vectors are far from most mutation vectors, The “Boxed Area” is the expanded visualization of the boxed area of “Binary” visualization where most of passenger mutation vectors exist. We found that the drivers and passengers were actually scrambled while they seemed to be well-separated in a broader scope
The most enriched cluster characterized using mutation vectors from Mut2Vec+PI+R
| Labels | In cluster | In population | |
|---|---|---|---|
| Known | 21 | 67 | 3.74e-37 |
| Known+predicted | 45 | 661 | 2.04e-51 |
| Candidates | 18 | 17923 | |
| All | 63 | 18584 |
Pathway enrichment analysis of the most enriched cluster characterized using mutation vectors from Mut2Vec+PI+R
| KEGG PATHWAY | Adjusted | Overlap |
|---|---|---|
| Pathways in cancer | 9.98e-24 | 25/397 |
| Central carbon metabolism in cancer | 2.51e-22 | 15/67 |
| Transcriptional misregulation in cancer | 6.76e-19 | 17/180 |
| MicroRNAs in cancer | 6.27e-14 | 16/297 |
| Prostate cancer | 1.93e-13 | 11/89 |
Genes in the most enriched cluster characterized using mutation vectors from Mut2Vec+PI+R
| Known | Predicted | Candidate |
|---|---|---|
| ABL1 | ASXL1 | BCL2 |
| ALK | BAP1 | CISH |
| DNMT3A | BCL6 | CRLF2 |
| EGFR | CALR | DUSP22 |
| ERBB2 | CCND1 | ERG |
| FGFR2 | CEBPA | EWSR1 |
| FGFR3 | ETV6 | FHIT |
| FLT3 | FGFR1 | MAML2 |
| GNAQ | H3F3A | MYBL1 |
| HRAS | IKZF1 | MYCL |
| IDH2 | MET | PDGFRB |
| JAK2 | MYCN | PLAG1 |
| KIT | NF2 | PRKACA |
| MYC | NOTCH1 | SPINK1 |
| MYD88 | NTRK1 | SS18 |
| NPM1 | PAX5 | TERT |
| NRAS | PDGFRA | TFE3 |
| RB1 | PPM1D | TP63 |
| RUNX1 | RET | |
| SDHB | RHOA | |
| SMO | ROS1 | |
| SMARCB1 | ||
| TET2 | ||
| WT1 |
Literature search results on driver candidates
| Gene | Tissue | References | |
|---|---|---|---|
| BCL2 | Leukemia | [ | |
| ERG | Prostate | [ | |
| FHIT | Pancreas | [ | |
| MAML2 | Bronchial Glands | ||
| Salivary Glands | [ | ||
| MYBL1 | Gland | [ | |
| MYCL | Lung | [ | |
| Nerve Tissue | [ | ||
| PDGFRB | Myofibroma | [ | |
| PRKACA | Cortisol | [ | |
| Liver | [ | ||
| SS18 | Diathrosis | [ | |
| TERT | Liver | [ | |
| Melanocyte | [ | ||
| Thyroid | [ | ||
| Unknown | [ | ||
| TP63 | Squamous Cell | [ |
Statistics of the most enriched cluster characterized using mutation vectors from Mut2Vec+PI(until2015)+R
| Labels | In cluster | In population | |
|---|---|---|---|
| Known | 8 | 67 | 1.21e-8 |
| Known+predicted | 40 | 661 | 3.99e-28 |
| Candidates | 81 | 17923 | |
| All | 121 | 18584 |
Literature search results on driver candidates characterized using mutation vectors from Mut2Vec+PI(until2015)+R
| Gene | Date | Tissue | References |
|---|---|---|---|
| RAD51 | 2016 | Lung, Kidney | [ |
| RAD51AP1 | 2017 | Melanoma | [ |
Top 10 enriched clusters in KEGG PATHWAY
| KEGG PATHWAY | Adjusted | Overlap | Cluster size |
|---|---|---|---|
| Neuroactive ligand-receptor interaction | 2.22e-63 | 47/277 | 85 |
| Chemical carcinogenesis | 9.37e-52 | 26/82 | 38 |
| Cytokine-cytokine receptor interaction | 2.48e-46 | 36/265 | 72 |
| Metabolic pathways | 1.80e-45 | 58/1239 | 89 |
| Ribosome | 4.50e-40 | 27/265 | 35 |
| Cell cycle | 5.55e-37 | 24/137 | 37 |
| Endocytosis | 6.75e-34 | 25/277 | 32 |
| Complement and coagulation cascades | 8.25e-33 | 21/124 | 35 |
| Fanconi anemia pathway | 6.41e-28 | 36/259 | 146 |
| Systemic lupus erythematosus | 4.26e-27 | 19/265 | 20 |