| Literature DB >> 34277640 |
Yajie Meng1, Min Jin1.
Abstract
The emergence of high-throughput RNA-seq data has offered unprecedented opportunities for cancer diagnosis. However, capturing biological data with highly nonlinear and complex associations by most existing approaches for cancer diagnosis has been challenging. In this study, we propose a novel hierarchical feature selection and second learning probability error ensemble model (named HFS-SLPEE) for precision cancer diagnosis. Specifically, we first integrated protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data to provide rich information; afterward, we designed a novel hierarchical feature selection method, which takes the CpG-gene biological associations into account and can select a compact set of superior features; next, we used four individual classifiers with significant differences and apparent complementary to build the heterogeneous classifiers; lastly, we developed a second learning probability error ensemble model called SLPEE to thoroughly learn the new data consisting of classifiers-predicted class probability values and the actual label, further realizing the self-correction of the diagnosis errors. Benchmarking comparisons on TCGA showed that HFS-SLPEE performs better than the state-of-the-art approaches. Moreover, we analyzed in-depth 10 groups of selected features and found several novel HFS-SLPEE-predicted epigenomics and epigenetics biomarkers for breast invasive carcinoma (BRCA) (e.g., TSLP and ADAMTS9-AS2), lung adenocarcinoma (LUAD) (e.g., HBA1 and CTB-43E15.1), and kidney renal clear cell carcinoma (KIRC) (e.g., IRX2 and BMPR1B-AS1).Entities:
Keywords: DNA methylation; biomarker; ensemble model; hierarchical feature selection; precision cancer diagnosis; transcriptome profiling
Year: 2021 PMID: 34277640 PMCID: PMC8278475 DOI: 10.3389/fcell.2021.696359
Source DB: PubMed Journal: Front Cell Dev Biol ISSN: 2296-634X
FIGURE 1A flowchart of HFS-SLPEE. We first integrate the protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data to get rich information. Afterward, considering the CpG-gene biological associations, we design a novel hierarchical feature selection method to get a compact group of superior features. Next, we train four heterogeneous classifiers in the training set with the selected features and optimize the parameters of heterogeneous classifiers via the grid search algorithm in the validating set. Finally, we develop a second learning probability error ensemble model (named SLPEE) to ensemble the class probability predictions of the heterogeneous classifiers under the optimal parameters. SLPEE is utilized to predict the testing set in each fold. HFS-SLPEE is a precision cancer diagnosis framework, which is powerful tool for precision cancer diagnosis.
FIGURE 2Heatmap analysis of differentially methylated genes and volcano plot analysis of differentially expression genes for the BRCA dataset. In panel (A), the row represents the methylation level of the genes, and the column represents the normal and tumor samples. Dark red shades indicate the higher level of methylation, and dark blue shades indicate the lower level of methylation. Color keys indicate the intensity associated with normalized beta values. In panel (B), the x-axis represents the log2FC, and the y-axis represents −log10(FDR), and each dot represents a gene. The significantly upregulated genes are highlighted in red, and the significantly downregulated genes in blue.
Summary of the original different datasets for three cancers.
| Datasets | No. of tumor | No. of normal | Total samples | Dimensions |
| Protein-coding gene | 1,109 | 113 | 1,222 | 19,676 |
| ncRNA | 1,109 | 113 | 1,222 | 40,568 |
| DNA methylation | 796 | 96 | 892 | 485,577 |
| Protein-coding gene | 535 | 59 | 594 | 19,676 |
| ncRNA | 535 | 59 | 594 | 40,568 |
| DNA methylation | 475 | 32 | 507 | 485,577 |
| Protein-coding gene | 539 | 72 | 611 | 19,676 |
| ncRNA | 539 | 72 | 611 | 40,568 |
| DNA methylation | 325 | 160 | 485 | 485,577 |
Summary of preprocessed datasets for three cancers.
| Datasets | No. of tumor | No. of normal | Total samples | Dimensions |
| Protein-coding gene | 1,098 | 113 | 1,211 | 19,676 |
| ncRNA | 1,098 | 113 | 1,211 | 40,568 |
| DNA methylation | 792 | 96 | 888 | 485,577 |
| Protein-coding gene | 517 | 59 | 576 | 19,676 |
| ncRNA | 517 | 59 | 576 | 40,568 |
| DNA methylation | 464 | 32 | 496 | 485,577 |
| Protein-coding gene | 531 | 72 | 603 | 19,676 |
| ncRNA | 531 | 72 | 603 | 40,568 |
| DNA methylation | 321 | 160 | 481 | 485,577 |
FIGURE 3The relationship curves of the features and the accuracy of three cancers.
The diagnosis results of three cancer by HFS-SLPEE (%).
| Metrics | BRCA | LUAD | KIRC |
| No. of features | 21 | 12 | 16 |
| Accuracy | 99.65% | 100% | 100% |
| Sensitivity | 99.61% | 100% | 100% |
| Specificity | 100% | 100% | 100% |
| F1-score | 99.81% | 100% | 100% |
FIGURE 4The comparison results of different datasets. (A) The histogram of comparison results. (B) The annotated heatmap of comparison results. As is shown, compared with the other seven datasets, the integrated data of the protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation can improve the performance of the model.
FIGURE 5The comparison results of SLPEE and the other four models.
Comparison results with the state-of-the-art approaches (%).
| Cancer | Datasets | Methods | Accuracy | Sensitivity | Specificity | F1 |
| BRCA | mRNA | 98.41 | – | – | – | |
| mRNA | Proposed | |||||
| DM | 98.33 | – | – | 94.90 | ||
| DM | Proposed | |||||
| mRNA + DM | 97.33 | 96.82 | – | |||
| mRNA + DM | Proposed | 98.80 | ||||
| Transcriptome + DM | Proposed | |||||
| LUAD | DM | 99.25 | – | – | 96.50 | |
| DM | Proposed | |||||
| Transcriptome + DM | Proposed | |||||
| KIRC | DM | 99.55 | – | – | 99.40 | |
| DM | Proposed | |||||
| Transcriptome + DM | Proposed |
The summary of selected features for the three cancers.
| Cancer | Protein-coding genes | ncRNAs | Methylated genes |
| BRCA | WT1-AS (5), AL513523.2(5), F13A1(4), ABCB10P4(4), PABPC5(2), OR14I1(1), NID2(1), AC005796.2(1), AC079922.3(1), CCDC181(1), IGHV3OR16-10(1) | ||
| LUAD | RP11-344B5.3(1) | ||
| KIRC | RP11-266E6.3(1), PIK3IP1-AS1(1) |