| Literature DB >> 27512621 |
Abstract
To date, the support vector machine (SVM) has been widely applied to diverse bio-medical fields to address disease subtype identification and pathogenicity of genetic variants. In this paper, I propose the weighted K-means support vector machine (wKM-SVM) and weighted support vector machine (wSVM), for which I allow the SVM to impose weights to the loss term. Besides, I demonstrate the numerical relations between the objective function of the SVM and weights. Motivated by general ensemble techniques, which are known to improve accuracy, I directly adopt the boosting algorithm to the newly proposed weighted KM-SVM (and wSVM). For predictive performance, a range of simulation studies demonstrate that the weighted KM-SVM (and wSVM) with boosting outperforms the standard KM-SVM (and SVM) including but not limited to many popular classification rules. I applied the proposed methods to simulated data and two large-scale real applications in the TCGA pan-cancer methylation data of breast and kidney cancer. In conclusion, the weighted KM-SVM (and wSVM) increases accuracy of the classification model, and will facilitate disease diagnosis and clinical treatment decisions to benefit patients. A software package (wSVM) is publicly available at the R-project webpage (https://www.r-project.org).Entities:
Keywords: K-means clustering; Support vector machine; TCGA; Weighted SVM
Year: 2016 PMID: 27512621 PMCID: PMC4960100 DOI: 10.1186/s40064-016-2677-4
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
The weighted KM-SVM (or SVM) with the boosting algorithm
| 1. Initialize the weight | |
| 2. For | |
| (1) Fit a KM-SVM (or SVM) | |
| (2) Compute | |
| | |
| (3) Compute | |
| (4) Set | |
| 3. Output |
Fig. 1Performance comparisons across different classification rules. Each dot represents the averaged values of repeated simulations, and the bars overlaid with dots represent standard errors. a Prediction errors of six different classification rules, b decreasing patterns of test error rates as r (coordinates of centers) increases in value
Fig. 2a Test error rates of the weighted SVM as the boosting increases in iteration. b Test error rates of the weighted KM-SVM as the boosting increases in iteration
Shown are the brief descriptions of the nineteen microarray datasets of disease-related binary phenotypes (e.g., case and control). All datasets are publicly available
| Name | Study | Type | # of samples | Control | Case | # of matched genes | Reference |
|---|---|---|---|---|---|---|---|
| BRCA | Breast cancer | Methylation | 343 | 27 | 316 | 10,121 | The Cancer Genome Atlas (TCGA) |
| KIRC | Kidney cancer | Methylation | 418 | 199 | 219 | 10,121 | The Cancer Genome Atlas (TCGA) |
Fig. 3a Performance comparisons of breast cancer data across different classification rules, b performance comparisons of kidney cancer data across different classification rules