| Literature DB >> 30697229 |
Victor Tkachev1, Maxim Sorokin1,2, Artem Mescheryakov3, Alexander Simonov1, Andrew Garazha1, Anton Buzdin1,2,4, Ilya Muchnik5, Nicolas Borisov1,4.
Abstract
Here, we propose a heuristic technique of data trimming for SVM termed FLOating Window Projective Separator (FloWPS), tailored for personalized predictions based on molecular data. This procedure can operate with high throughput genetic datasets like gene expression or mutation profiles. Its application prevents SVM from extrapolation by excluding non-informative features. FloWPS requires training on the data for the individuals with known clinical outcomes to create a clinically relevant classifier. The genetic profiles linked with the outcomes are broken as usual into the training and validation datasets. The unique property of FloWPS is that irrelevant features in validation dataset that don't have significant number of neighboring hits in the training dataset are removed from further analyses. Next, similarly to the k nearest neighbors (kNN) method, for each point of a validation dataset, FloWPS takes into account only the proximal points of the training dataset. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. FloWPS performance was tested on ten gene expression datasets for 992 cancer patients either responding or not on the different types of chemotherapy. We experimentally confirmed by leave-one-out cross-validation that FloWPS enables to significantly increase quality of a classifier built based on the classical SVM in most of the applications, particularly for polynomial kernels.Entities:
Keywords: bioinformatics; gene expression; machine learning; oncology; personalized medicine; support vector machines
Year: 2019 PMID: 30697229 PMCID: PMC6341065 DOI: 10.3389/fgene.2018.00717
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Clinically annotated gene expression datasets.
| Reference | Dataset ID | Disease type | Treatment type | Experimental platform | Number of samples | Number of |
|---|---|---|---|---|---|---|
| GSE25066 | Breast cancer with different hormonal and HER2 status | Neoadjuvant taxane + anthracycline | Affymetrix Human Genome U133 Array | 235 (118 responders, 117 non-responders) | 20 | |
| GSE41998 | Breast cancer with different hormonal and HER2 status | Neoadjuvant doxorubicin + cyclophosphamide, followed by paclitaxel | Affymetrix Human Genome U133 Array | 68 (34 responders, 34 non-responders) | 11 | |
| GSE9782 | Multiple myeloma | Bortezomib | Affymetrix Human Genome U133 Array | 169 (85 responders, 84 non-responders) | 18 | |
| GSE39754 | Multiple myeloma | Vincristine + adriamycin + dexamethasone followed by ASCT | Affymetrix Human Exon 1.0 ST Array | 124 (62 responders, 62 non-responders) | 16 | |
| GSE68871 | Multiple myeloma | Bortezomib-thalidomide-dexamethasone (VTD) | Affymetrix Human Genome U133 Plus | 98 (49 responders, 49 non-responders) | 12 | |
| GSE55145 | Multiple myeloma | Bortezomib followed by ASCT | Affymetrix Human Exon 1.0 ST Array | 56 (28 responders, 28 non-responders) | 14 | |
| TARGET-50 | Childhood kidney Wilms tumor | Vincristine sulfate + non-target drugs + conventional surgery + radiation therapy | Illumina HiSeq 2000 | 72 (36 responders, 36 non-responders) | 14 | |
| TARGET-10 | Childhood B acute lymphoblastic leukemia | Vincristine sulfate + non-target drugs | Illumina HiSeq 2000 | 60 (30 responders, 30 non-responders) | 14 | |
| TARGET-20 | Childhood acute myeloid leukemia | Non-target drugs including busulfan and cyclophosphamide | Illumina HiSeq 2000 | 46 (23 responders, 23 non-responders) | 10 | |
| TARGET-20 | Childhood acute myeloid leukemia | Non-target drugs excluding busulfan and cyclophosphamide | Illumina HiSeq 2000 | 124 (62 responders, 62 non-responders) | 16 | |
FIGURE 1Data trimming pipeline. (A) selection of relevant features in FloWPS according to the m-condition. A violet dot shows the position of a validation point. Turquoise dots stand for the points from the training dataset. The features (here: f and f) are considered relevant when they satisfy the criterion that at least m flanking training points must be present on both sides relative to the validation point along the feature-specific axis. In the figure, it is exemplified that m-condition is satisfied for f1 feature when m = 0 only, and for the f2, when m ≤ 5. (B) After selection of the relevant features, only k nearest neighbors in the training sets are selected to construct the SVM model. On the figure, k = 4, although k starting from 20 was used in our calculations, to build SVM model.
FIGURE 2Optimization of data trimming parameters m and k for a given individual. (A) Overall scheme for prediction for an individual sample i = 1, N. All but one individuals serve as a training dataset. For a training dataset at the fitting step, the AUC for a classifier prediction is calculated and plotted (B) as a function of data trimming parameters m and k. Positions of this AUC topogram where AUC > p ⋅ max(AUC), p = 0.95, are considered prediction-accountable (highlighted with bright yellow color) and form the prediction-accountable set S. This AUC topogram, as well as the set S, is individual for every validation point i.
Performance of clinical response classifiers for clinically annotated gene expression datasets.
| Dataset | Top 30 marker genes | Core marker genes | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | FloWPS | FloWPS | SVM | FloWPS | FloWPS | |||||||
| AUC | FDR | AUC | FDR | AUC | FDR | AUC | FDR | AUC | FDR | AUC | FDR | |
| GSE25066 ( | 0.28 | 0.10 | 0.13 | 0.26 | 0.25 | 0.23 | ||||||
| GSE41998 ( | 0.25 | 0.14 | 0.14 | 0.14 | 0.15 | 0.12 | ||||||
| GSE9782 ( | 0.28 | 0.22 | 0.17 | 0.33 | 0.33 | 0.34 | ||||||
| GSE39754 ( | 0.36 | 0.27 | 0.34 | 0.36 | 0.36 | 0.35 | ||||||
| GSE68871 ( | 0.35 | 0.25 | 0.27 | 0.33 | 0.20 | 0.24 | ||||||
| GSE55145 ( | 0.19 | 0.11 | 0.11 | 0.24 | 0.19 | 0.06 | ||||||
| TARGET-50 ( | 0.35 | 0.13 | 0.16 | 0.26 | 0.08 | 0.09 | ||||||
| TARGET-10 ( | 0.16 | 0.14 | 0.12 | 0.13 | 0.07 | 0.04 | ||||||
| TARGET-20 ( | 0.26 | 0.16 | 0.17 | 0.23 | 0.22 | 0.00 | ||||||
| TARGET-20 ( | 0.28 | 0.30 | 0.27 | 0.26 | 0.13 | 0.11 | ||||||
FIGURE 3Distribution (violin plots together with each instance showed as a red/green dot) of FloWPS predictions (P) for patients without (red plots and dots) and with (green plots and dots) positive clinical response to chemotherapy treatment. For FloWPS, core marker genes and p = 0.90 settings were used. Black horizontal line shows the discrimination threshold (τ) between responders and non-responders for each classifier. Panels represent different data sources, (A) GSE25066; (B) GSE41998; (C) GSE9782; (D) GSE39754; (E) GSE68871; (F) GSE55134; (G) TARGET-50; (H) TARGET-10; (I) and (J): TARGET-20 with and without busulfan and cyclophosphamide, respectively.
FIGURE 4Receiver–operator curves (ROC) showing the dependence of sensitivity (Sn) upon specificity (Sp) for FloWPS-based classifier of treatment response for datasets with core marker genes. Red dots: confidence parameter p = 0.95, blue dots: p = 0.90. Panels represent different clinically annotated datasets, (A) GSE25066; (B) GSE41998; (C) GSE9782; (D) GSE39754; (E) GSE68871; (F) GSE55134; (G) TARGET-50; (H) TARGET-10; (I,J) TARGET-20 with and without busulfan and cyclophosphamide, respectively.
FIGURE 5AUC and FDR for (non)responders classifier as a function of cost/penalty parameter C for classical SVM (without data trimming) and FloWPS for both linear and polynomial kernels. Calculations were done for core marker gene datasets and confidence parameter p = 0.90. Different panels represent different datasets, (A) GSE25066; (B) GSE41998; (C) GSE9782; (D) GSE39754; (E) GSE68871; (F) GSE55134; (G) TARGET-50; (H) TARGET-10; (I,J) TARGET-20 with and without busulfan and cyclophosphamide, respectively. (K) Legend showing FloWPS and SVM modifications.
AUC of (non)responder classifier for classical SVM without data reduction (SVM), PCA-assisted SVM (PCA) and FloWPS with confidence parameter p = 0.90.
| Dataset | Linear kernel | Polynomial kernel | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | PCA | FloWPS | SVM | PCA | FloWPS | SVM | PCA | FloWPS | SVM | PCA | FloWPS | |
| GSE25066 ( | 0.73 | 0.77 | 0.76 | 0.63 | 0.77 | 0.75 | 0.65 | 0.67 | 0.74 | 0.63 | 0.66 | 0.75 |
| GSE41998 ( | 0.87 | 0.84 | 0.92 | 0.82 | 0.88 | 0.86 | 0.60 | 0.62 | 0.69 | 0.75 | 0.74 | 0.81 |
| GSE9782 ( | 0.68 | 0.72 | 0.72 | 0.60 | 0.72 | 0.72 | 0.62 | 0.68 | 0.73 | 0.64 | 0.68 | 0.76 |
| GSE39754 ( | 0.69 | 0.68 | 0.72 | 0.56 | 0.68 | 0.71 | 0.66 | 0.61 | 0.67 | 0.65 | 0.61 | 0.68 |
| GSE68871 ( | 0.68 | 0.68 | 0.77 | 0.69 | 0.68 | 0.76 | 0.64 | 0.65 | 0.72 | 0.69 | 0.76 | 0.74 |
| GSE55145 ( | 0.77 | 0.84 | 0.82 | 0.77 | 0.84 | 0.85 | 0.63 | 0.73 | 0.77 | 0.80 | 0.73 | 0.83 |
| TARGET-50 ( | 0.72 | 0.75 | 0.82 | 0.68 | 0.76 | 0.81 | 0.68 | 0.64 | 0.73 | 0.65 | 0.72 | 0.74 |
| TARGET-10 ( | 0.87 | 0.85 | 0.94 | 0.82 | 0.83 | 0.94 | 0.68 | 0.65 | 0.85 | 0.78 | 0.83 | 0.86 |
| TARGET-20 ( | 0.76 | 0.78 | 0.83 | 0.70 | 0.80 | 0.82 | 0.63 | 0.63 | 0.77 | 0.83 | 0.72 | 0.82 |
| TARGET-20 ( | 0.74 | 0.81 | 0.79 | 0.65 | 0.79 | 0.79 | 0.69 | 0.68 | 0.77 | 0.72 | 0.69 | 0.79 |
FIGURE 6(A) Global machine learning methods, such as SVM, may fail to separate classes in datasets without global order. (B) Machine-learning with data trimming works locally and may separate classes more accurately.