| Literature DB >> 35386679 |
Abstract
Genomic copy number variations (CNVs) are among the most important structural variations of genes found to be related to the risk of individual cancer and therefore they can be utilized to provide a clue to the research on the formation and progression of cancer. In this paper, an improved computational gene selection algorithm called CRIA (correlation-redundancy and interaction analysis based on gene selection algorithm) is introduced to screen genes that are closely related to cancer from the whole genome based on the value of gene CNVs. The CRIA algorithm mainly consists of two parts. Firstly, the main effect feature is selected out from the original feature set that has the largest correlation with the class label. Secondly, after the analysis involving correlation, redundancy and interaction for each feature in the candidate feature set, we choose the feature that maximizes the value of the custom selection criterion and add it into the selected feature set and then remove it from the candidate feature set in each selection round. Based on the real datasets, CRIA selects the top 200 genes to predict the type of cancer. The experiments' results of our research show that, compared with the state-of-the-art related methods, the CRIA algorithm can extract the key features of CNVs and a better classification performance can be achieved based on them. In addition, the interpretable genes highly related to cancer can be known, which may provide new clues at the genetic level for the treatment of the cancer.Entities:
Keywords: cancers prediction; copula entropy; copy number variations (CNVs); correlation-redundancy analysis; gene selection; interaction analysis
Year: 2022 PMID: 35386679 PMCID: PMC8978562 DOI: 10.3389/fpls.2022.839044
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
The number of samples for each cancer type in this dataset.
|
|
|
|
|
|---|---|---|---|
| 1 | UCEC (Uterine corpus endometrial carcinoma) | 443 | 12.73% |
| 2 | KIRC (Kidney renal clear cell carcinoma) | 490 | 14.08% |
| 3 | OV (Ovarian serous cystadenocarcinoma) | 562 | 16.15% |
| 4 | GBM (Glioblastoma multiforme) | 563 | 16.18% |
| 5 | COAD/READ (Colon adenocarcinoma/Rect-um adenocarcinoma) | 575 | 16.52% |
| 6 | BRCA (Breast invasive carcinoma) | 847 | 24.34% |
| Total | 3,480 | 100% |
Figure 1General flow chart of the proposed algorithm.
CRIA: correlation-redundancy and interaction analysis based gene selection algorithm.
| 1 First initializes Ω |
| 2 for each |
| 3 calculate |
| 4 end for |
| 5 select the feature |
| 6 Ω |
| 7 |
| 8 while |
| 9 for |
| 10 calculate |
| 11 calculate |
| 12 calculate |
| 13 end for |
| 14 select the feature |
| 15 Ω |
| 16 |
| 17 end while |
| 18 output Ω |
Datasets for comparison between CRIA algorithm and other algorithms.
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
| Biological data | 1 | leukemia | 72 | 7,070 | 2 | Discrete |
| 2 | Carcinoma | 174 | 9,182 | 11 | Continuous | |
| 3 | colon | 62 | 2,000 | 2 | Discrete | |
| 4 | TOX_171 | 171 | 5,748 | 4 | Continuous | |
| Digit recognition | 5 | Gisette | 7,000 | 5,000 | 2 | Continuous |
Comparision (mean ± std.dev.) of performance between CRIA and other 8 algorithms with J48 classifier.
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
| leukemia | 93.08 ± 0.67 | 93.09 ± 0.64 | 92.51 ± 1.14 | 92.05 ± 0.73 | 93.36 ± 0.59 | 92.70 ± 0.98 | 93.07 ± 1.10 | 91.50 ± 1.44 | |
| Carcinoma | 71.73 ± 1.80 | 71.78 ± 2.26 | 69.79 ± 1.58 | 64.47 ± 1.82 | 64,84 ± 1.68 | 68.65 ± 2.20 | 68.01 ± 1.97 | 65.53 ± 1.89 | |
| colon | 76.55 ± 3.86 | 77.14 ± 3.31 | 74.32 ± 2.48 | 76.32 ± 2.07 | 77.28 ± 3.77 | 77.30 ± 4.73 | 74.71 ± 3.96 | 73.31 ± 3.77 | |
| TOX_171 | 62.01 ± 1.61 | 62.21 ± 2.04 | 56.61 ± 2.21 | 60.94 ± 3.08 | 62.09 ± 2.14 | 57.49 ± 3.01 | 59.78 ± 2.02 | 60.18 ± 1.91 | |
| gisette | 92.40 ± 0.08 | 92.02 ± 0.08 | 92.66 ± 0.10 | 92.05 ± 0.12 | 91.18 ± 0.12 | 92.84 ± 0.08 | 92.82 ± 0.07 | 93.35 ± 0.08 | |
| Avg.acc | 81.11 | 79.19 | 78.53 | 78.04 | 77.40 | 77.63 | 78.25 | 77.76 | 77.19 |
| Avg.rank | 1.60 | 4.00 | 5.00 | 5.80 | 6.60 | 5.60 | 4.80 | 5.80 | 5.80 |
| Improved | – | 2.42% | 3.29% | 3.93% | 4.79% | 4.48% | 3.65% | 4.31% | 5.08% |
The meaning of the bold values represent the best performance achieved on a certain dataset for the nine methods.
Comparision (mean ± std.dev.) of performance between CRIA and other 8 algorithms with Naïve Bayes classifier.
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
| leukemia | 97.44 ± 0.66 | 96.27 ± 0.30 | 97.03 ± 0.70 | 95.48 ± 1.96 | 96.18 ± 0.39 | 97.15 ± 0.80 | 96.79 ± 0.71 | 96.70 ± 0.58 | |
| Carcinoma | 81.61 ± 1.11 | 80.23 ± 1.33 | 81.58 ± 0.95 | 76.38 ± 1.85 | 80.19 ± 1.28 | 80.41 ± 0.86 | 80.34 ± 0.87 | 80.46 ± 1.09 | |
| colon | 82.97 ± 1.34 | 82.70 ± 1.20 | 80.72 ± 1.47 | 74.39 ± 4.37 | 81.66 ± 1.45 | 79.84 ± 2.40 | 78.95 ± 1.72 | 82.32 ± 2.08 | |
| TOX_171 | 69.53 ± 0.70 | 63.73 ± 1.61 | 68.68 ± 1.26 | 65.64 ± 1.63 | 60.41 ± 2.35 | 66.96 ± 1.56 | 67.04 ± 1.74 | 70.28 ± 1.60 | |
| gisette | 90.46 ± 0.13 | 88.26 ± 0.02 | 87.69 ± 0.08 | 86.23 ± 0.24 | 86.01 ± 0.05 | 87.60 ± 0.04 | 87.48 ± 0.03 | 89.46 ± 0.05 | |
| Avg.acc | 86.52 | 84.73 | 82.24 | 83.14 | 79.62 | 80.89 | 82.39 | 82.12 | 83.84 |
| Avg.rank | 1.60 | 1.80 | 5.80 | 4.40 | 8.40 | 7.80 | 5.40 | 6.20 | 3.80 |
| Improved | – | 2.11% | 5.20% | 4.07% | 8.67% | 6.96% | 5.01% | 5.36% | 3.20% |
The meaning of the bold values represent the best performance achieved on a certain dataset for the nine methods.
Comparision (mean ± std.dev.) of performance between CRIA and other 8 algorithms with IB1 classifier.
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
| leukemia | 97.03 ± 0.87 | 96.16 ± 0.43 | 95.56 ± 0.90 | 88.75 ± 1.88 | 96.61 ± 0.78 | 96.13 ± 0.90 | 95.73 ± 0.95 | 94.22 ± 0.80 | |
| Carcinoma | 82.45 ± 1.33 | 81.39 ± 1.02 | 82.32 ± 1.07 | 76.88 ± 1.97 | 81.10 ± 1.06 | 81.39 ± 1.16 | 81.35 ± 1.08 | 80.96 ± 1.18 | |
| colon | 78.60 ± 2.10 | 78.22 ± 1.54 | 75.94 ± 2.32 | 70.87 ± 1.97 | 76.69 ± 2.35 | 71.77 ± 3.41 | 72.24 ± 2.17 | 74.87 ± 1.80 | |
| TOX_171 | 84.56 ± 0.52 | 85.13 ± 1.26 | 78.14 ± 1.36 | 82.59 ± 1.78 | 76.68 ± 1.68 | 81.69 ± 1.64 | 82.05 ± 1.28 | 84.04 ± 1.48 | |
| gisette | 91.88 ± 0.08 | 91.26 ± 0.09 | 92.26 ± 0.07 | 91.05 ± 0.15 | 90.20 ± 0.10 | 92.70 ± 0.06 | 92.58 ± 0.05 | 93.13 ± 0.10 | |
| Avg.acc | 90.27 | 87.02 | 85.03 | 86.25 | 82.03 | 84.26 | 84.74 | 84.79 | 85.44 |
| Avg.rank | 1.40 | 2.80 | 5.30 | 4.20 | 8.00 | 6.40 | 5.50 | 5.80 | 5.60 |
| Improved | – | 3.73% | 6.16% | 4.66% | 10.05% | 7.13% | 6.53% | 6.46% | 5.65% |
The meaning of the bold values represent the best performance achieved on a certain dataset for the nine methods.
The top 15 feature genes chosen by CRIA defined as equation (26).
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | RPS15 | ribosomal protein S15 | Protein Coding | 0.168 |
| 2 | TBC1D5 | TBC1 Domain Family Member 5 | Protein Coding | 0.089 |
| 3 | CUL2 | Cullin 2 | Protein Coding | 0.093 |
| 4 | SMPD3 | Sphingomyelin Phosphodiesterase 3 | Protein Coding | 0.089 |
| 5 | CTAGE10P | CTAGE Family Member 10, Pseudogene | Pseudogene | 0.071 |
| 6 | C1orf98 | Chromosome 1 Open Reading Frame 98 | Protein Coding | 0.043 |
| 7 | ZNF281 | Zinc Finger Protein 281 | Protein Coding | 0.061 |
| 8 | CDKN2A | Cyclin Dependent Kinase Inhibitor 2A | Protein Coding | 0.161 |
| 9 | EGFR | Epidermal Growth Factor Receptor | Protein Coding | 0.121 |
| 10 | TMEM98 | Transmembrane Protein 98 | Protein Coding | 0.103 |
| 11 | CTBP2 | C-Terminal Binding Protein 2 | Protein Coding | 0.083 |
| 12 | SEMA6A | Semaphorin 6A | Protein Coding | 0.081 |
| 13 | MIR1208 | MicroRNA 1208 | RNA Gene | 0.077 |
| 14 | RBFOX1 | RNA Binding Fox-1 Homolog 1 | Protein Coding | 0.069 |
| 15 | CDC25A | Cell Division Cycle 25A | Protein Coding | 0.066 |
Figure 2Classification accuracies of three classifiers (CatBoost, LightGBM and SVM) with different numbers of features during the IFS procedure. The top 200 feature genes are selected by CRIA method.
Average performance of precision, recall and F1-score on 10 test datasets with three classifiers via ten-fold cross-validation (%).
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|
| Precision | CRIA_CatBoost | 74.31 | 93.74 | 84.59 | 94.63 | 89.67 | 84.48 |
| CRIA_SVM | 70.47 | 90.33 | 85.40 | 95.64 | 88.54 | 84.76 | |
| CRIA_LightGBM | 71.46 | 93.37 | 82.84 | 95.72 | 90.23 | 84.71 | |
| Recall | CRIA_CatBoost | 73.14 | 91.63 | 87.90 | 90.76 | 86.09 | 88.67 |
| CRIA_SVM | 73.81 | 89.59 | 88.43 | 89.70 | 83.30 | 87.96 | |
| CRIA_LightGBM | 72.91 | 92.04 | 89.32 | 91.30 | 83.48 | 87.01 | |
| F1-score | CRIA_CatBoost | 73.72 | 92.67 | 86.21 | 92.65 | 87.84 | 86.52 |
| CRIA_SVM | 72.13 | 89.96 | 86.89 | 92.57 | 85.84 | 86.33 | |
| CRIA_LightGBM | 72.18 | 92.70 | 85.96 | 93.46 | 86.72 | 85.84 | |
Performance comparison of the proposed algorithm predictions with those of other methods (%).
|
|
|
|
|
|
|---|---|---|---|---|
| UCEC | CRIA_CatBoost |
|
|
|
| CRIA_SVM | 70.47 |
|
| |
| CRIA_LightGBM | 71.46 |
|
| |
| CNA_origin | 67.92 | 72.00 | 69.90 | |
| mRMR_Dagging | 74.19 | 46.93 | 57.50 | |
| KIRC | CRIA_CatBoost |
| 91.63 |
|
| CRIA_SVM |
| 89.59 | 89.96 | |
| CRIA_LightGBM |
| 92.04 |
| |
| CNA_origin | 88.89 |
| 92.31 | |
| mRMR_Dagging | 80.85 | 92.68 | 86.36 | |
| OV | CRIA_CatBoost | 84.59 |
| 86.21 |
| CRIA_SVM | 85.40 |
| 86.89 | |
| CRIA_LightGBM | 82.84 |
| 85.96 | |
| CNA_origin |
| 86.72 |
| |
| mRMR_Dagging | 84.61 | 75.86 | 80.00 | |
| GBM | CRIA_CatBoost |
|
|
|
| CRIA_SVM |
|
|
| |
| CRIA_LightGBM |
|
|
| |
| CNA_origin | 93.10 | 84.38 | 88.52 | |
| mRMR_Dagging | 88.70 | 85.93 | 87.30 | |
| COADREAD | CRIA_CatBoost |
|
|
|
| CRIA_SVM |
|
|
| |
| CRIA_LightGBM |
|
|
| |
| CNA_origin | 81.58 | 73.81 | 77.50 | |
| mRMR_Dagging | 60.00 | 73.46 | 66.05 | |
| BRCA | CRIA_CatBoost | 84.48 | 88.67 | 86.52 |
| CRIA_SVM | 84.76 | 87.96 | 86.33 | |
| CRIA_LightGBM | 84.71 | 87.01 | 85.84 | |
| CNA_origin |
|
|
| |
| mRMR_Dagging | 79.16 | 87.35 | 83.06 |
Figure 3Performance comparison of 4 evaluation metrics: accuracy, precision, recall and F1-score among our methods and the other two algorithms (CNA_origin and mRMR_Dagging).
Figure 4Confusion matrices on the test groups: (A) CRIA_CatBoost; (B) CRIA_SVM; (C) CRIA_LightGBM.