| Literature DB >> 35788693 |
Kevin Y X Wang1,2, Gulietta M Pupo3,4, Varsha Tembe3,4, Ellis Patrick1,2,3,5, Dario Strbenac1,2, Sarah-Jane Schramm3,4, John F Thompson4,6,7, Richard A Scolyer1,4,7,8, Samuel Muller2,9, Garth Tarr2,5, Graham J Mann10,11,12, Jean Y H Yang13,14,15.
Abstract
In this modern era of precision medicine, molecular signatures identified from advanced omics technologies hold great promise to better guide clinical decisions. However, current approaches are often location-specific due to the inherent differences between platforms and across multiple centres, thus limiting the transferability of molecular signatures. We present Cross-Platform Omics Prediction (CPOP), a penalised regression model that can use omics data to predict patient outcomes in a platform-independent manner and across time and experiments. CPOP improves on the traditional prediction framework of using gene-based features by selecting ratio-based features with similar estimated effect sizes. These components gave CPOP the ability to have a stable performance across datasets of similar biology, minimising the effect of technical noise often generated by omics platforms. We present a comprehensive evaluation using melanoma transcriptomics data to demonstrate its potential to be used as a critical part of a clinical screening framework for precision medicine. Additional assessment of generalisation was demonstrated with ovarian cancer and inflammatory bowel disease studies.Entities:
Year: 2022 PMID: 35788693 PMCID: PMC9253123 DOI: 10.1038/s41746-022-00618-5
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Overview of CPOP and the motivating melanoma dataset.
a Schematic illustration of the five-step CPOP procedure with emphasis on the stable selection of features in Steps 3 and 4. b Quartile plot of the expression values of all genes (top panel) and all pairwise (bottom panel) log-ratio features for each sample (n = 488) in the melanoma data collection. Each sample is represented by the median (a single solid point), the first quartile (the lower end of a vertical line) and the third quartile (the upper end of a vertical line) of all the gene/feature values for that sample. c NanoString probe selection (186 probes) based on results from our previous microarray studies[7,14]. d Scatter plot of log fold-change for genes common between MIA-Microarray and MIA-NanoString. e Boxplot comparisons of overall accuracy for overall survival (OS) and recurrence-free survival (RFS) between the MIA-Microarray and MIA-NanoString data. The y-axis shows the classification accuracy calculated from 100 repeated 5-fold cross-validation using a support vector machine classifier. The good/poor prognosis classes for overall survival (OS) and recurrence-free survival (RFS) are defined in Methods. The centreline of a boxplot denotes the median classification accuracy, and the lower and upper bounds of the box denote the first and third quartile values, respectively. The lower and upper whiskers denote 1.5 times the interquartile range away from the first and third quartile values, respectively.
Data summaries for five melanoma datasets, MIA-Microarray, MIA-NanoString, TCGA, Sweden and MIA-Validation.
| MIA-Microarray and MIA-NanoString | TCGA | Sweden | MIA-validation | |
|---|---|---|---|---|
| Number of samples | 45 | 139 | 210 | 46 |
| Median age (years) | 62 | 56 | 64 | 61 |
| Sex F | 17 (38%) | 59 (42%) | 86 (41%) | 14 (30%) |
| | 28 (62%) | 80 (58%) | 124 (59%) | 32 (70%) |
| Stage/metastasis type | Stage III: 45 (100%) | Stage III: 139 (100%) | General: 23 (11%) | Stage III: 46 (100%) |
| In-transit: 15 (7%) | ||||
| Local: 11 (5%) | ||||
| Primary: 15 (7%) | ||||
| Regional:139 (66%) | ||||
| NA: 7 (3%) | ||||
| Median survival (months) | OS: 22 | OS: 26.9 | DSS: 17.6 | OS: 65.5 |
| RFS: 8 | RFS: 9 | |||
| Survival status (Alive) | Yes: 19 (42%) | Yes: 78 (56%) | Yes: 108 (51%) | Yes: 26 (57%) |
| No: 26 (58%) | No: 61 (44%) | No: 102 (49%) | No: 20 (43%) |
Included are the number of samples, the median age of the cohort, gender (sex), the median survival time in month and survival status. OS, RFS and DSS refer to overall survival, recurrence-free survival and disease-specific survival, respectively.
Fig. 2Schematic drawings and figures for the comparison between prediction values and re-substituted values.
a, b Schematic drawing of a non-transferable and a transferable model, respectively. By fixing the samples in a validation data to be the same, we can compare a model’s prediction values (a model trained using training samples independent of the validation data) and the re-substituted predicted value (a model with the same configurations but trained using the validation data itself). A transferable model should produce predicted values on identical scales as the re-substituted predicted values and so each point on the scatterplot, representing a sample in the validation data, should be randomly scattered around the identity line (y = x). A non-transferable model typically exhibits biases and clustering away from the identity line. c A real-data illustration of a non-transferable Lasso model, trained on the MIA-Microarray data and validated on the TCGA melanoma data. d A real-data illustration of a transferable CPOP model, trained on the MIA-Microarray and MIA-NanoString data and validated on the TCGA melanoma data. e Scatter plot illustrating the concordance between between-data predicted hazard ratios and within-data (re-substituted) hazard ratios. We trained a CPOP penalised Cox model with recurrence-free survival times as the response by combining the MIA-Microarray and MIA-NanoString data. Each of the four panels illustrates a combination of the Lasso and the CPOP model predictions on the TCGA and the Sweden data. R denotes the Pearson’s correlation coefficient.
Fig. 3Validation results of the melanoma dataset.
a Kaplan–Meier plots show a significant difference in survival probability between the predicted good (blue line) and poor (orange line) prognostic classes on the TCGA (n = 139). The CPOP model here is trained on MIA-Microarray and MIA-NanoString based on the RFS-defined prognosis classes. b Similar evaluation for the Sweden (n = 210) data. c Similar evaluation for the MIA-Validation data (n = 46) (including four imputed genes). d Network visualisation of the final CPOP model highlights the ratio-based signatures developed from applying the CPOP model on the MIA-Microarray and MIA-NanoString data. Each node of the network represents a gene and an edge connecting two genes (nodes) represents the log-ratio feature that is present in the signature. The colour and thickness of each edge represent the sign and the magnitude of the size of the estimated coefficients, respectively. The genes are in alpha-numeric ordering.
Fig. 4Validation results of the IBD dataset.
a Schematic drawing of the pre-processing steps of the IBD data. Due to reagent set changes, samples in this data are separated into three major batches. Top 100 DE genes are selected from the IBD2 cohort to reduce the number of features used for modelling. b Schematic drawing of the evaluation steps of the IBD data. Treating both the CPOP and Lasso methods as feature selection methods only and using the ridge regression model, we compare the predicted values against benchmark standard values where the feature selection, model training and validation data are identical. c Scatter plots comparing predicted probabilities of inflammation. The CPOP method is able to produce prediction values that are more consistent between different batches (codesets) with higher correlation (denoted as R in the figure) and closer to the benchmark standard prediction values (denoted as ID in the figure).