| Literature DB >> 32825264 |
Dongmei Ai1,2, Yuduo Wang2, Xiaoxin Li2, Hongfei Pan2.
Abstract
An effective feature extraction method is key to improving the accuracy of a prediction model. From the Gene Expression Omnibus (GEO) database, which includes 13,487 genes, we obtained microarray gene expression data for 238 samples from colorectal cancer (CRC) samples and normal samples. Twelve gene modules were obtained by weighted gene co-expression network analysis (WGCNA) on 173 samples. By calculating the Pearson correlation coefficient (PCC) between the characteristic genes of each module and colorectal cancer, we obtained a key module that was highly correlated with CRC. We screened hub genes from the key module by considering module membership, gene significance, and intramodular connectivity. We selected 10 hub genes as a type of feature for the classifier. We used the variational autoencoder (VAE) for 1159 genes with significantly different expressions and mapped the data into a 10-dimensional representation, as another type of feature for the cancer classifier. The two types of features were applied to the support vector machines (SVM) classifier for CRC. The accuracy was 0.9692 with an AUC of 0.9981. The result shows a high accuracy of the two-step feature extraction method, which includes obtaining hub genes by WGCNA and a 10-dimensional representation by variational autoencoder (VAE).Entities:
Keywords: classifier; colorectal cancer; hub genes; variational autoencoder; weighted gene co-expression network analysis
Year: 2020 PMID: 32825264 PMCID: PMC7563725 DOI: 10.3390/biom10091207
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1PCA analysis of four datasets before and after the batch effect correction. (A) PCA analysis of four datasets before batch effect correction. Dataset GSE23878 was significantly different from the other three datasets; (B) PCA analysis of four datasets after batch effect correction. The batch effects of the four datasets were basically eliminated after the correction.
Figure 2(A) System clustering tree of all samples after classification. Part a is the clustering tree constructed by genes, part b is the gene module obtained by clustering, and part c is the gene module obtained by combining similar expression patterns; (B) The thermal map of the relationship between the eigenvalues of different modules.
Gene module and number of corresponding genes.
| Color | Tan | Brown | Turquoise | Blue | Green | Purple |
|---|---|---|---|---|---|---|
| Number | 85 | 2799 | 6377 | 2636 | 400 | 153 |
|
|
|
|
|
|
|
|
| Number | 330 | 175 | 60 | 318 | 118 | 36 |
The PCC between each module and cancer after pruning.
| MEtan | MEbrown | MEturquoise | MEblue | MEgreen | MEpurple | |
|---|---|---|---|---|---|---|
|
| −0.1285 | −0.3052 | −0.9251 | −0.7075 | −0.2017 | 0.3457 |
|
| 0.0920 | 0.0000 | 0.0000 | 0.0000 | 0.0078 | 0.0000 |
|
|
|
|
|
|
| |
|
| −0.4263 | −0.5753 | 0.0609 | 0.5127 | 0.2944 | 0.3082 |
|
| 0.0000 | 0.0000 | 0.4260 | 0.0000 | 0.0000 | 0.0000 |
Top 10 genes with the highest intramodular connectivity.
| GENE NAME | logFC | adj.P.Val | GS | MM.Turquoise | K.in |
|---|---|---|---|---|---|
|
| 1.3042 | 2.40 × 10−39 | 0.8128 | −0.9076 | 933.8071 |
|
| 6.3918 | 2.92 × 10−80 | 0.9417 | −0.9206 | 922.1908 |
|
| 1.3517 | 3.34 × 10−41 | 0.8226 | −0.9055 | 918.2304 |
|
| 1.9085 | 7.08 × 10−40 | 0.8118 | −0.8997 | 906.0183 |
|
| −6.4444 | 6.12 × 10−53 | 0.8723 | 0.9075 | 895.3242 |
|
| 2.1435 | 9.16 × 10−52 | 0.8706 | −0.9122 | 895.1551 |
|
| 1.3173 | 1.02 × 10−43 | 0.8368 | −0.8970 | 894.6524 |
|
| 1.9410 | 2.07 × 10−46 | 0.8474 | −0.8905 | 890.4061 |
|
| −5.3201 | 2.29 × 10−50 | 0.8628 | 0.8628 | 887.2683 |
|
| 1.4126 | 5.21 × 10−39 | 0.8085 | −0.8173 | 883.3120 |