| Literature DB >> 34079578 |
Dongfang Jia1, Cheng Chen1, Chen Chen1, Fangfang Chen1, Ningrui Zhang1, Ziwei Yan1, Xiaoyi Lv1,2.
Abstract
Mastering the molecular mechanism of breast cancer (BC) can provide an in-depth understanding of BC pathology. This study explored existing technologies for diagnosing BC, such as mammography, ultrasound, magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography (PET) and summarized the disadvantages of the existing cancer diagnosis. The purpose of this article is to use gene expression profiles of The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) to classify BC samples and normal samples. The method proposed in this article triumphs over some of the shortcomings of traditional diagnostic methods and can conduct BC diagnosis more rapidly with high sensitivity and have no radiation. This study first selected the genes most relevant to cancer through weighted gene co-expression network analysis (WGCNA) and differential expression analysis (DEA). Then it used the protein-protein interaction (PPI) network to screen 23 hub genes. Finally, it used the support vector machine (SVM), decision tree (DT), Bayesian network (BN), artificial neural network (ANN), convolutional neural network CNN-LeNet and CNN-AlexNet to process the expression levels of 23 hub genes. For gene expression profiles, the ANN model has the best performance in the classification of cancer samples. The ten-time average accuracy is 97.36% (±0.34%), the F1 value is 0.8535 (±0.0260), the sensitivity is 98.32% (±0.32%), the specificity is 89.59% (±3.53%) and the AUC is 0.99. In summary, this method effectively classifies cancer samples and normal samples and provides reasonable new ideas for the early diagnosis of cancer in the future.Entities:
Keywords: ANN; PPI; SVM; WGCNA; breast cancer
Year: 2021 PMID: 34079578 PMCID: PMC8165442 DOI: 10.3389/fgene.2021.628136
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Workflow of this study.
FIGURE 2Module-trait relationships in the (A) GEO and (B) TCGA. Each row represents a gene module. Each column corresponds to the clinical characteristics of the cancer. Each grid contains the correlation coefficient and P-value of the gene module.
FIGURE 3The differentially expressed genes (DEGs) were screened. Each column of the heat map represents the sample, the row represents the gene, and each grid represents the degree of gene expression in the sample. The row of the volcano graph represents log|FC|, and the column represents −log10 (adjusted P-value), and each point is the degree of gene expression. (A) Heat plot of DEGs in the GEO. (B) Volcano plot of DEGs in the GEO. (C) Heat plot of DEGs in the TCGA. (D) Volcano plot of DEGs in the TCGA.
FIGURE 4Gene Venn diagrams between the two groups of DEGs and the two groups of co-expressed genes.
FIGURE 5Protein–protein interaction (PPI) network. Each node in the figure represents a protein, and the edge represents the interaction between the two proteins.
Maximal clique centrality and degree of hub genes.
| Node name | MCC | Degree | Node name | MCC | Degree |
| GNG11 | 134 | 9 | PPARG | 34 | 13 |
| ANXA1 | 132 | 7 | CEBPA | 18 | 6 |
| GNAI1 | 127 | 7 | FABP4 | 18 | 8 |
| IGF1 | 123 | 8 | JUN | 17 | 10 |
| VWF | 121 | 6 | ADIPOQ | 14 | 5 |
| A2M | 121 | 6 | EDNRB | 12 | 4 |
| ACKR3 | 120 | 5 | TF | 12 | 6 |
| P2RY14 | 120 | 5 | IL6 | 12 | 9 |
| S1PR1 | 120 | 5 | FOS | 12 | 7 |
| CFD | 120 | 5 | LPL | 11 | 6 |
| CLU | 120 | 5 | LEP | 10 | 6 |
| SERPING1 | 120 | 5 |
Pathway enrichment analysis of hub genes.
| KEGG pathway ID and term | Count | Genes | |
| hsa05200: Pathways in cancer | 9 | 7.25 × 10–6 | CEBPA, IL6, JUN, EDNRB, PPARG, FOS, IGF1, GNG11, GNAI1 |
| hsa05133: Pertussis | 5 | 5.53 × 10–5 | IL6, JUN, SERPING1, FOS, GNAI1 |
| hsa04932: Non-alcoholic fatty liver disease (NAFLD) | 5 | 8.22 × 10–4 | CEBPA, IL6, JUN, LEP, ADIPOQ |
| hsa03320: PPAR signaling pathway | 4 | 8.94 × 10–4 | FABP4, ADIPOQ, LPL, PPARG |
| hsa04610: Complement and coagulation cascades | 4 | 9.74 × 10–4 | CFD, VWF, SERPING1, A2M |
| hsa05142: Chagas disease (American trypanosomiasis) | 4 | 3.17 × 10–3 | IL6, JUN, FOS, GNAI1 |
| hsa04152: AMPK signaling pathway | 4 | 5.09 × 10–3 | LEP, ADIPOQ, PPARG, IGF1 |
| hsa05202: Transcriptional misregulation in cancer | 4 | 1.18 × 10–2 | CEBPA, IL6, PPARG, IGF1 |
| hsa05132: Salmonella infection | 3 | 2.37 × 10–2 | IL6, JUN, FOS |
| hsa05323: Rheumatoid arthritis | 3 | 2.65 × 10–2 | IL6, JUN, FOS |
Accuracy results of each model.
| Model | First (%) | Second (%) | Third (%) | Average (SD) |
| SVM | 97.28 | 96.73 | 96.46 | 96.82% (±0.34%) |
| ANN | 97.82 | 97.00 | 97.27 | 97.36% (±0.34%) |
| CNN (LeNet) | 91.01 | 89.65 | 90.46 | 90.37% (±0.56%) |
| CNN (AlexNet) | 91.82 | 90.46 | 91.55 | 91.27% (±0.59%) |
| BN | 93 | 93 | 93 | 93% (0) |
| DT | 95.6 | 95.3 | 94.8 | 95.23% (±0.33%) |
Model metrics.
| Model | F1 (SD) | Sensitivity (SD) | Specificity (SD) |
| SVM | 0.8176 (±0.0477) | 97.69% (±0.88%) | 83.80% (±4.64%) |
| ANN | 0.8535 (±0.0260) | 98.32% (±0.32%) | 89.59% (±3.53%) |
FIGURE 6The ROC and AUC of ANN and SVM.