| Literature DB >> 24497942 |
Chihyun Park1, Jaegyoon Ahn1, Hyunjin Kim1, Sanghyun Park1.
Abstract
BACKGROUND: The prognosis of cancer recurrence is an important research area in bioinformatics and is challenging due to the small sample sizes compared to the vast number of genes. There have been several attempts to predict cancer recurrence. Most studies employed a supervised approach, which uses only a few labeled samples. Semi-supervised learning can be a great alternative to solve this problem. There have been few attempts based on manifold assumptions to reveal the detailed roles of identified cancer genes in recurrence.Entities:
Mesh:
Year: 2014 PMID: 24497942 PMCID: PMC3908883 DOI: 10.1371/journal.pone.0086309
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Datasets used throughout the manuscript.
| Cancer type | GEO assess number | No. of labeled samples1 | No. of unlabeled samples | No. of genes after filtering |
| Breast | GSE2990 | 125 (76: −1, 49: +1) | 64 | 13,046 |
| Colorectal | GSE17536 | 145 (109: −1, 36: +1) | 32 | 13,046 |
| Colon | GSE17538 | 181 (132: −1, 49: +1) | 32 | 13,046 |
| Breast | GSE4922 | 249 (160: −1, 89: +1) | 0 | 13,046 |
| Colorectal | GSE18105 | 111 (67: −1, 44: +1) | 0 | 13,046 |
|
|
|
|
| |
| Protein-Protein Interaction | Human PPI | 108,544(mapped to a gene symbol) | I2D database |
−1: non-recurrence, +1: recurrence.
Figure 1Detailed workflow to determine the optimal parameter set.
First, we construct a graph for regularization with only labeled samples by varying two parameters. In this phase, we use k-fold cross validation to determine the optimal parameter set. We then apply semi-supervised learning with the obtained optimal parameter set and predict the labels of the unknown samples. The proposed method uses unlabeled sample information to build a classifier by iterating the procedure.
Figure 2Detailed workflow of the proposed semi-supervised learning algorithm.
We apply a graph regularization approach for semi-supervised learning, and the purpose of the proposed method is to predict the labels of unlabeled samples.
Figure 3Experimental results of parameter testing.
We performed 100 different experiments while changing two threshold values and obtained 100 average accuracies for each dataset using 10-fold cross validation. We found the maximum, minimum, and average accuracies for each dataset in two cases. (1) We carried out 10-fold cross validation over 100 times, varying the two thresholds of the original samples as shown in Table 1. (2) We also carried out 10-fold cross validation over 100 times, varying the two thresholds after balancing the number of samples in the two classes. We randomly removed samples 27, 73, and 83 from the non-recurrence groups GSE2990, GSE17536, and GSE17538, respectively.
Optimal combination of two thresholds for each dataset in 10-fold cross validation.
| Cross validation | Group | Dataset (# of samplesfor each class) | Optimal | Optimal | Best accuracy | Sen. | Spec. | ||
| K = 10 | Original | GSE2990 (76: −1, 49: +1, 64: U) | 0.20 | 0.72 | 0.725 | 0.617 | 0.795 | ||
| GSE17536 (109: −1, 36: +1, 32: U) | 0.15 | 0.86 | 0.807 | 0.485 | 0.906 | ||||
| GSE17538 (132: −1, 49: +1, 32: U) | 0.20 | 0.72 | 0.756 | 0.163 | 0.977 | ||||
| Adjusted | GSE2990 (49: −1, 49: +1, 64: U) | 0.45 | 0.76 | 0.767 | 0.721 | 0.809 | |||
| GSE17536 (36: −1, 36: +1, 32: U) | 0.15 | 0.84 | 0.786 | 0.882 | 0.694 | ||||
| GSE17538 (49: −1, 49: +1, 32: U) | 0.35 | 0.90 | 0.767 | 0.756 | 0.778 | ||||
Sen. = Sensitivity, Spec. = Specificity.
Predicting performance comparison of the proposed method with four existing methods using PPI data to identify informative genes.
| Cancer type (GSE No.) | Data description | Proposed method | TSVM | SVM | Naïve Bayesian | Random Forest | |
|
|
| ||||||
| Breast (GSE2990) | L:125(−1∶76, +1∶49) U:64 | 0.725 (0.617/0.795) | 0.543 (−/−) | 0.528 (0.671/0.306) | 0.592 (0.605/0.571) | 0.664 (0.921/0.265) | |
| Colorectal (GSE17536) | L:145(−1∶109, +1∶36) U:32 | 0.807 (0.485/0.906) | 0.752 (−/−) | 0.772 (0.889/0.389) | 0759 (0.844/0.500) | 0.752 (0.963/0.111) | |
| Colon (GSE17538) | L:181(−1∶132, +1∶49) U:32 | 0.756 (0.163/0.977) | 0.728 (−/−) | 0.796 (0.917/0.469) | 0.707 (0.826/0.388) | 0.713 (0.955/0.061) | |
|
|
| ||||||
| Breast (GSE2990) | L:98(−1∶49, +1∶49) U:64 | 0.767 (0.721/0.809) | 0.499 (−/−) | 0.510 (0.495/0.525) | 0.576 (0.574/0.565) | 0.522 (0.418/0.627) | |
| Colorectal (GSE17536) | L:72(−1∶36, +1∶36) U:32 | 0.786 (0.882/0.694) | 0.499 (−/−) | 0.630 (0.672/0.587) | 0.640 (0.628/0.652) | 0.597 (0.550/0.644) | |
| Colon (GSE17538) | L:98(−1∶49, +1∶49) U:32 | 0.767 (0.756/0.778) | 0.498 (−/−) | 0.635 (0.657/0.614) | 0.592 (0.465/0.718) | 0.572 (0.486/0.663) | |
For each experiment, the optimal combination of two thresholds was obtained using the approach mentioned above and was applied to an independent test using unlabeled samples. Bold font indicates the superior performer.
TSVM: P (the ratio of two class labels).
SVM: PolyKernel –C 250007–E 1.0, The complexity parameter C (1.0), epsilon (1.0E−12), filterType (Normalized training data).
Naïve Bayesian: No parameters.
Random Forest: numTrees (10), seed (1).
Figure 4Experimental results of AUC comparison of the proposed method with three existing methods.
We compared AUC values of the proposed method and other supervised learning algorithms.
Figure 5Representation of a breast cancer recurrence-specific gene sub-network related to cancer proliferation.
The orange-colored nodes are oncogenes.