| Literature DB >> 33329750 |
Yu-Hang Zhang1,2, Zhandong Li3, Tao Zeng4, Xiaoyong Pan5, Lei Chen6, Dejing Liu7, Hao Li3, Tao Huang7, Yu-Dong Cai1.
Abstract
Glioblastoma, also called glioblastoma multiform (GBM), is the most aggressive cancer that initiates within the brain. GBM is produced in the central nervous system. Cancer cells in GBM are similar to stem cells. Several different schemes for GBM stratification exist. These schemes are based on intertumoral molecular heterogeneity, preoperative images, and integrated tumor characteristics. Although the formation of glioblastoma is remarkably related to gene methylation, GBM has been poorly classified by epigenetics. To classify glioblastoma subtypes on the basis of different degrees of genes' methylation, we adopted several powerful machine learning algorithms to identify numerous methylation features (sites) associated with the classification of GBM. The features were first analyzed by an excellent feature selection method, Monte Carlo feature selection (MCFS), resulting in a feature list. Then, such list was fed into the incremental feature selection (IFS), incorporating one classification algorithm, to extract essential sites. These sites can be annotated onto coding genes, such as CXCR4, TBX18, SP5, and TMEM22, and enriched in relevant biological functions related to GBM classification (e.g., subtype-specific functions). Representative functions, such as nervous system development, intrinsic plasma membrane component, calcium ion binding, systemic lupus erythematosus, and alcoholism, are potential pathogenic functions that participate in the initiation and progression of glioblastoma and its subtypes. With these sites, an efficient model can be built to classify the subtypes of glioblastoma.Entities:
Keywords: classification; glioblastoma; methylation; signature; subtype
Year: 2020 PMID: 33329750 PMCID: PMC7732602 DOI: 10.3389/fgene.2020.604336
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Breakdown of the GBM samples in the training and independent datasets.
| Category | Training dataset | Independent dataset |
| G34 | 41 | 13 |
| MES | 56 | 104 |
| MID | 14 | 19 |
| MYCN | 16 | 17 |
| RTK | 64 | 44 |
| RTK II | 143 | 118 |
| RTK III | 13 | 9 |
FIGURE 1Flowchart of the analysis performed in this study. The training dataset is first analyzed by the Monte Carlo feature selection (MCFS) method. Features are ranked in a list, which is fed into the incremental feature selection (IFS) with one of three classification algorithms. The optimal classifiers based on different classification algorithms are built and further evaluated their performance on a test dataset.
FIGURE 2IFS curves with support vector machine, random forest, and RIPPER on the training set. The support vector machine can yield the highest MCC (0.939) when top 4100 features are used, while the highest MCCs of random forest and RIPPER are 0.882 and 0.737, respectively, when top 1690 and 1180, features respectively, are adopted.
10-fold cross-validation performance of the optimal SVM, RF, and RIPPER classifiers on the training set.
| Classification algorithm | Number of features | Overall accuracy | MCC |
| SVM | 4100 | 0.954 | 0.939 |
| RF | 1690 | 0.911 | 0.882 |
| RIPPER | 1180 | 0.804 | 0.737 |
FIGURE 3Performance of the optimal SVM, RF, and RIPPER classifiers on different categories in the training dataset. The optimal SVM and RF classifiers are much superior to the optimal RIPPER classifier, and the optimal SVM classifier is slightly superior to the optimal RF classifier.
Performance of the optimal SVM, RF, and RIPPER classifiers on the independent test dataset.
| Classification algorithm | Overall accuracy | MCC |
| SVM | 0.852 | 0.798 |
| RF | 0.877 | 0.832 |
| RIPPER | 0.954 | 0.937 |
FIGURE 4Performance of the optimal SVM, RF, and RIPPER classifiers on different categories in the independent test dataset. The optimal RIPPER classifier gives the best generalizability on the independent test dataset, followed by the optimal RF and SVM classifier.