| Literature DB >> 32908867 |
Yingyi Hao1, Li He2, Yifan Zhou1, Yiru Zhao3, Menglong Li1, Runyu Jing4, Zhining Wen1,5.
Abstract
In clinical cancer research, it is a hot topic on how to accurately stratify patients based on genomic data. With the development of next-generation sequencing technology, more and more types of genomic features, such as mRNA expression level, can be used to distinguish cancer patients. Previous studies commonly stratified patients by using a single type of genomic features, which can only reflect one aspect of the cancer. In fact, multiscale genomic features will provide more information and may be helpful for clinical prediction. In addition, most of the conventional machine learning algorithms use a handcrafted gene set as features to construct models, which is generally selected by a statistical method with an arbitrary cut-off, e.g., p value < 0.05. The genes in the gene set are not necessarily related to the cancer and will make the model unreliable. Therefore, in our study, we thoroughly investigated the performance of different machine learning methods on stratifying breast cancer patients with a single type of genomic features. Then, we proposed a strategy, which can take into account the degree of correlation between genes and cancer patients, to identify the features from mRNAs and microRNAs, and evaluated the performance of the models with the new combined features of the multiscale genomic features. The results showed that, compared with the models constructed with a single type of features, the models with the multiscale genomic features generated by our proposed method achieved better performance on stratifying the ER status of breast cancer patients. Moreover, we found that the identified multiscale genomic features were closely related to the cancer by gene set enrichment analysis, indicating that our proposed strategy can well reflect the biological relevance of the genes to breast cancer. In conclusion, modelling with multiscale genomic features closely related to the cancer not only can guarantee the prediction performance of the models but also can effectively provide candidate genes for interpreting the mechanisms of cancer.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32908867 PMCID: PMC7471833 DOI: 10.1155/2020/1475368
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The workflow of our study.
Figure 2The MCCs achieved by different machine learning algorithms combined with different feature selection methods. (a) Prediction results with the top 300 mRNAs as features. (b) Prediction results with the top 300 microRNAs as features.
Figure 3Mean MCCs achieved by the models across different machine learning algorithms and different feature selection methods. (a) Mean MCCs achieved by different machine learning algorithms with the top 300 mRNAs as features. (b) Mean MCCs achieved by different machine learning algorithms with the top 300 microRNAs as features. (c) Mean MCCs achieved by using different feature selection methods with the top 300 mRNAs as features. (d) Mean MCCs achieved by using different feature selection methods with the top 300 microRNAs as features.
Figure 4MCCs for the independent test set by using mRNAs, microRNAs, and the combination of mRNAs and microRNAs as features. The Welch t-test p value for the MCCs achieved by the models with combined mRNAs and miRNAs before and after feature selection was 0.171.
The KEGG pathways enriched with 289 mRNAs identified by SHAP.
| KEGG pathway |
|
|---|---|
| hsa05120: epithelial cell signaling in | 0.036 |
| hsa05206: microRNAs in cancer | 0.036 |
| hsa01100: metabolic pathways | 0.034 |
| hsa05200: pathways in cancer | 0.026 |
The top ten GO terms related to biological processes significantly enriched with 289 mRNAs identified by SHAP.
| GO terms |
|
|---|---|
| GO:0042493~response to drug |
|
| GO:0035239~tube morphogenesis |
|
| GO:0051093~negative regulation of developmental process |
|
| GO:0042592~homeostatic process |
|
| GO:0051270~regulation of cellular component movement |
|
| GO:0048565~digestive tract development |
|
| GO:0048871~multicellular organismal homeostasis |
|
| GO:0010817~regulation of hormone levels |
|
| GO:0035148~tube formation | 0.001 |
| GO:0045595~regulation of cell differentiation | 0.002 |
The top ten GO terms related to molecular functions significantly enriched with 289 mRNAs identified by SHAP.
| GO terms |
|
|---|---|
| GO:0004716~receptor signaling protein tyrosine kinase activity |
|
| GO:0042802~identical protein binding | 0.001 |
| GO:0008134~transcription factor binding | 0.002 |
| GO:0045502~dynein binding | 0.006 |
| GO:0046983~protein dimerization activity | 0.011 |
| GO:0019899~enzyme binding | 0.017 |
| GO:0016769~transferase activity, transferring nitrogenous groups | 0.030 |
| GO:0016772~transferase activity, transferring phosphorus-containing groups | 0.035 |
| GO:0044325~ion channel binding | 0.039 |
| GO:0019904~protein domain-specific binding | 0.046 |