| Literature DB >> 31726986 |
Yunyun Dong1,2, Wenkai Yang1, Jiawen Wang1, Juanjuan Zhao1, Yan Qiang3, Zijuan Zhao1, Ntikurako Guy Fernand Kazihise1, Yanfen Cui4, Xiaotong Yang4, Siyuan Liu5.
Abstract
BACKGROUND: Lung cancer is one of the most common types of cancer, among which lung adenocarcinoma accounts for the largest proportion. Currently, accurate staging is a prerequisite for effective diagnosis and treatment of lung adenocarcinoma. Previous research has used mainly single-modal data, such as gene expression data, for classification and prediction. Integrating multi-modal genetic data (gene expression RNA-seq, methylation data and copy number variation) from the same patient provides the possibility of using multi-modal genetic data for cancer prediction. A new machine learning method called gcForest has recently been proposed. This method has been proven to be suitable for classification in some fields. However, the model may face challenges when applied to small samples and high-dimensional genetic data.Entities:
Keywords: MLW-gcForest staging model multi-modal genetic data lung adenocarcinoma
Mesh:
Substances:
Year: 2019 PMID: 31726986 PMCID: PMC6857238 DOI: 10.1186/s12859-019-3172-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Illustration of the cascade forest structur e[28]
Fig. 2Illustration of the MLW-gcForest (multi-weighted gcForest) model. The MLW-gcForest model is composed of multi-grained scanning and a cascade forest. We made two improvements in the multi-grained scanning module: we assign different weights to the random forests according to the classification performance of each random forest and name the weights α; and we assign corresponding weights to different sliding windows and name the weights β
Fig. 3The basic structure of the sorting optimization algorithm. Different weights are assigned to the class vectors generated by different sliding windows
Fig. 4MLW-gcForest decision fusion of multi-modal data. Multi-modal (gene expression RNA-seq, methylation data and CNV) lung adenocarcinoma genetic data are used to train different MLW-gcForest models, and decision-level fusion is performed
Fig. 5Nested cross-validation process, which includes an outer loop and an inner loop. The inner cross-validation loop is used to perform parameter adjustments, while the outer cross-validation loop is used to calculate the final error estimate for model performance
Fig. 6The results from the lung adenocarcinoma staging model based on different single-modal datasets with different algorithms. Row (a) contains the results from the use of the methylation data with different algorithms: (a1) ROC curves; (a2) Accuracy; (a3) Recall. Row (b) contains the results from the use of the RNA-seq data with different algorithms: (b1) ROC curves; (a2) Accuracy; (a3) Recall. Row (c) contains the results from the use of the CNV data with different algorithms: (c1) ROC curves; (c2) Accuracy; (c3) Recall
Performance comparison of different modal data with different methods
| Classification algorithm | Methylation | RNA | CNV | |||
|---|---|---|---|---|---|---|
| Precision | F1 | Precision | F1 | Precision | F1 | |
| SVM | 0.524 | 0.519 | 0.552 | 0.558 | 0.427 | 0.434 |
| KNN | 0.584 | 0.605 | 0.533 | 0.528 | 0.460 | 0.466 |
| LR | 0.575 | 0.572 | 0.609 | 0.603 | 0.446 | 0.486 |
| RF | 0.606 | 0.618 | 0.611 | 0.602 | 0.512 | 0.557 |
| gcForest | 0.715 | 0.709 | 0.634 | 0.643 | 0.616 | 0.628 |
| MLW-gcForest | 0.771 | 0.767 | 0.659 | 0.669 | 0.675 | 0.677 |
Fig. 7Performance comparison using different algorithms based on multi-modal data: (a) ROC curve, (b) Accuracy
The effects of different classification algorithms on the precision, recall, and F1 score of the staged model of the multi-modal data
| Algorithm | Precision | Recall | F1 |
|---|---|---|---|
| SVM | 0.674 | 0.664 | 0.669 |
| KNN | 0.664 | 0.646 | 0.655 |
| LR | 0.675 | 0.669 | 0.672 |
| RF | 0.706 | 0.730 | 0.718 |
| gcForest | 0.764 | 0.795 | 0.779 |
| MLW-gcForest | 0.896 | 0.882 | 0.889 |
Fig. 8Performance comparison between multi-modal, methylation, RNA and CNV
Performance of the lung adenocarcinoma staging model with different modalities of data
| Modality | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Methylation | 0.751 | 0.771 | 0.763 | 0.767 |
| RNA | 0.689 | 0.659 | 0.679 | 0.669 |
| CNV | 0.645 | 0.675 | 0.677 | 0.677 |
| Multi-modal | 0.908 | 0.896 | 0.882 | 0.889 |
Fig. 9Comparison of MLW-gcForest models constructed with different numbers of decision trees in the decision forests for (a) methylation data, (b) RNA-seq data, and (c) CNV data