Guang Li1, Meng Yang2, Longke Ran3, Fu Jin4. 1. Department of Radiotherapy, Chongqing University Cancer Hospital, Chongqing, China. 2. Department of Equipment, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China. 3. Department of Bioinformatics, Chongqing Medical University, Chongqing, China. longkeran@aliyun.com. 4. Department of Radiotherapy, Chongqing University Cancer Hospital, Chongqing, China. jfazj@126.com.
Abstract
OBJECTIVE: To use weighted gene correlation network analysis (WGCNA) and machine learning algorithm to predict classification of early pulmonary nodes with public databases. METHODS: The expression data and clinical data of lung cancer patients were firstly extracted from public database (GTEx and TCGA) to study the differentially expressed genes (DEGs) of lung adenocarcinoma (LUAD). The intersection of three R packages (Dseq2, Limma, EdgeR) methods were selected as candidate DEGs for further study. WGCNA was used to obtain relevant modules and key genes of lung cancer classification, GO and KEGG enrichment analysis was performed. The model was built using two machine learning methods, Least Absolute Shrinkage and Selection Operator (LASSO) regression and tumor classification was also predicted with extreme Gradient Boosting (XGBoost) algorithm. RESULTS: DEGs analysis revealed that there were 1306 LUAD genes. WGCNA module analysis showed that a total of 116 genes were significantly related to classification, and module genes were mainly related to 14 KEGG pathways. The machine learning algorithm identified 10 target genes by LASSO regression analysis of differential genes, and 18 genes were identified by XGBoost model. A total of 6 genes were found from the intersection of the above methods as classification signatures of early pulmonary nodules, including "HMGB3" "ARHGAP6" "TCF21" "FCN3" "COL6A6" "GOLM1". CONCLUSION: Using DEGs analysis, WGCNA method and machine learning algorithm, six gene signatures related to early stage of LUAD, which can assist clinicians in disease classification prediction.
OBJECTIVE: To use weighted gene correlation network analysis (WGCNA) and machine learning algorithm to predict classification of early pulmonary nodes with public databases. METHODS: The expression data and clinical data of lung cancer patients were firstly extracted from public database (GTEx and TCGA) to study the differentially expressed genes (DEGs) of lung adenocarcinoma (LUAD). The intersection of three R packages (Dseq2, Limma, EdgeR) methods were selected as candidate DEGs for further study. WGCNA was used to obtain relevant modules and key genes of lung cancer classification, GO and KEGG enrichment analysis was performed. The model was built using two machine learning methods, Least Absolute Shrinkage and Selection Operator (LASSO) regression and tumor classification was also predicted with extreme Gradient Boosting (XGBoost) algorithm. RESULTS: DEGs analysis revealed that there were 1306 LUAD genes. WGCNA module analysis showed that a total of 116 genes were significantly related to classification, and module genes were mainly related to 14 KEGG pathways. The machine learning algorithm identified 10 target genes by LASSO regression analysis of differential genes, and 18 genes were identified by XGBoost model. A total of 6 genes were found from the intersection of the above methods as classification signatures of early pulmonary nodules, including "HMGB3" "ARHGAP6" "TCF21" "FCN3" "COL6A6" "GOLM1". CONCLUSION: Using DEGs analysis, WGCNA method and machine learning algorithm, six gene signatures related to early stage of LUAD, which can assist clinicians in disease classification prediction.
Authors: Rhoda Mae C Simora; Max R Bangs; Wenwen Wang; Xiaoli Ma; Baofeng Su; Mohd G Q Khan; Zhenkui Qin; Cuiyu Lu; Veronica Alston; Darshika Hettiarachchi; Andrew Johnson; Shangjia Li; Michael Coogan; Jeremy Gurbatow; Jeffery S Terhune; Xu Wang; Rex A Dunham Journal: Sci Rep Date: 2020-12-17 Impact factor: 4.379