Literature DB >> 33719339

Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression.

Qin Jiang1, Min Jin1.   

Abstract

Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer.
Copyright © 2021 Jiang and Jin.

Entities:  

Keywords:  breast cancer; classification; feature selection; gradient boosted decision tree; machine learning

Year:  2021        PMID: 33719339      PMCID: PMC7952975          DOI: 10.3389/fgene.2021.629946

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.599


  36 in total

1.  Downregulation of ARFGEF1 and CAMK2B by promoter hypermethylation in breast cancer cells.

Authors:  Ju Hee Kim; Tae Woo Kim; Sun Jung Kim
Journal:  BMB Rep       Date:  2011-08       Impact factor: 4.778

2.  Replication of RYR3 gene polymorphism association with cIMT among HIV-infected whites.

Authors:  Sadeep Shrestha; Qi Yan; Gregory Joseph; Donna K Arnett; Jeremy J Martinson; Lawrence A Kingsley
Journal:  AIDS       Date:  2012-07-31       Impact factor: 4.177

3.  A Novel Method for Identifying the Potential Cancer Driver Genes Based on Molecular Data Integration.

Authors:  Wei Zhang; Shu-Lin Wang
Journal:  Biochem Genet       Date:  2019-05-21       Impact factor: 1.890

4.  Impacts of somatic mutations on gene expression: an association perspective.

Authors:  Peilin Jia; Zhongming Zhao
Journal:  Brief Bioinform       Date:  2017-05-01       Impact factor: 11.622

5.  Correlation of somatic mutation and expression identifies genes important in human glioblastoma progression and survival.

Authors:  David L Masica; Rachel Karchin
Journal:  Cancer Res       Date:  2011-05-09       Impact factor: 12.701

Review 6.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge.

Authors:  Katarzyna Tomczak; Patrycja Czerwińska; Maciej Wiznerowicz
Journal:  Contemp Oncol (Pozn)       Date:  2015

7.  Network based stratification of major cancers by integrating somatic mutation and gene expression data.

Authors:  Zongzhen He; Junying Zhang; Xiguo Yuan; Zhaowen Liu; Baobao Liu; Shouheng Tuo; Yajun Liu
Journal:  PLoS One       Date:  2017-05-16       Impact factor: 3.240

8.  Integrating Germline and Somatic Mutation Information for the Discovery of Biomarkers in Triple-Negative Breast Cancer.

Authors:  Jiande Wu; Tarun Karthik Kumar Mamidi; Lu Zhang; Chindo Hicks
Journal:  Int J Environ Res Public Health       Date:  2019-03-23       Impact factor: 3.390

9.  SMOTE for high-dimensional class-imbalanced data.

Authors:  Rok Blagus; Lara Lusa
Journal:  BMC Bioinformatics       Date:  2013-03-22       Impact factor: 3.169

10.  Integrating mutation and gene expression cross-sectional data to infer cancer progression.

Authors:  Julia L Fleck; Ana B Pavel; Christos G Cassandras
Journal:  BMC Syst Biol       Date:  2016-01-25
View more
  1 in total

1.  Identifying common transcriptome signatures of cancer by interpreting deep learning models.

Authors:  Anupama Jha; Mathieu Quesnel-Vallières; David Wang; Andrei Thomas-Tikhonenko; Kristen W Lynch; Yoseph Barash
Journal:  Genome Biol       Date:  2022-05-17       Impact factor: 17.906

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.