| Literature DB >> 29872543 |
Fatemeh Vafaee1, Connie Diakos2, Michaela B Kirschner3, Glen Reid3,4, Michael Z Michael5, Lisa G Horvath4,6,7, Hamid Alinejad-Rokny8, Zhangkai Jason Cheng9,10, Zdenka Kuncic9,10, Stephen Clarke2.
Abstract
Recent advances in high-throughput technologies have provided an unprecedented opportunity to identify molecular markers of disease processes. This plethora of complex-omics data has simultaneously complicated the problem of extracting meaningful molecular signatures and opened up new opportunities for more sophisticated integrative and holistic approaches. In this era, effective integration of data-driven and knowledge-based approaches for biomarker identification has been recognised as key to improving the identification of high-performance biomarkers, and necessary for translational applications. Here, we have evaluated the role of circulating microRNA as a means of predicting the prognosis of patients with colorectal cancer, which is the second leading cause of cancer-related death worldwide. We have developed a multi-objective optimisation method that effectively integrates a data-driven approach with the knowledge obtained from the microRNA-mediated regulatory network to identify robust plasma microRNA signatures which are reliable in terms of predictive power as well as functional relevance. The proposed multi-objective framework has the capacity to adjust for conflicting biomarker objectives and to incorporate heterogeneous information facilitating systems approaches to biomarker discovery. We have found a prognostic signature of colorectal cancer comprising 11 circulating microRNAs. The identified signature predicts the patients' survival outcome and targets pathways underlying colorectal cancer progression. The altered expression of the identified microRNAs was confirmed in an independent public data set of plasma samples of patients in early stage vs advanced colorectal cancer. Furthermore, the generality of the proposed method was demonstrated across three publicly available miRNA data sets associated with biomarker studies in other diseases.Entities:
Year: 2018 PMID: 29872543 PMCID: PMC5981448 DOI: 10.1038/s41540-018-0056-1
Source DB: PubMed Journal: NPJ Syst Biol Appl ISSN: 2056-7189
Fig. 1Outline of the method. a The construction steps of the miRNA-mediated regulatory network: (1) miRNA target genes (TGs) that are either validated experimentally or predicted by two different data sets were retrieved using multiMiR which is an R package providing access to 11 publicly available data sets. Transcription factor (TF) targets were retrieved from ORTI database which compiles validated mammalian TF-TG interactions from six public data sets as well as the literature. The miRNA-mediated regulatory network was constructed using a recursive algorithm described in Supplementary Figure S3. (2) The network was then annotated using 339 CRC-associated genes identified by MalaCards; 35 ‘elite’ genes with strong causal associations with CRC progression were ranked ‘1’ and the rest of CRC genes were ranked ‘2’. (3) Using the annotated network, a functional relevance (FR) score was calculated for each miRNA (using Eq. (1)) and a look up table was returned to be used in the subsequent biomarker discovery. b FR calculation on an example network. c Schematic view of the proposed multi-objective optimisation-based biomarker discovery workflow: The pre-processed samples were partitioned to validation and discovery sets using fivefold cross-validation. The multi-objective optimiser was run on discovery set where objectives are prediction errors and averaged FR scores of the population of putative signatures. Optimal miRNA signatures (i.e., Pareto front solutions) and their corresponding predictive models were then used to classify test samples and the performance measures were reported. The whole process repeated for 50 times to account for random partitioning of samples and the average performance measures were reported (Fig. 3)
Fig. 3Performance comparison with relevant approaches with inherent feature selection. The performance of the proposed multi-objective optimiser was compared with relevant methods with inherent feature selection—i.e., single-objective optimiser, Lasso, guided RRF and penalised SVM. a The accuracy, specificity sensitivity and functional relevance score were averaged over 50 runs of sample partitioning using fivefold cross validation. b Sizes of the identified signatures or the number of features selected by each method over 50 independent runs were shown as box plots. c As a measure of signature stability, Jaccard Index was computed for all pairs of signatures identified by each of compared methods across 50 runs and the average values were reported. In all bar charts, error bars show standard deviations and multi-objective optimiser bars were marked by ‘*’ when the proposed method significantly outperforms others (Wilcoxon test p values < 0.001)
Baseline patient characteristics
| Characteristics | Description | |
|---|---|---|
| Gender (F/M) | 30/45 | F: Female, M: Male |
| Age | 59 years | Average age at enrolment |
| Survival (mean ∓ std) | 20.98 ∓ 11.67 months | Survival times for 53 patients have not been reported as they have been alive at the end of the follow-up and their prognostic status was considered as ‘long survival’. |
| Tumour site (C/R/RS) | 45/24/6 | C: Colon, R: Rectum, RS: Rectosigmoid |
| Chemotherapy regime | FOLFOX | All 75 patients received FOLFOX-based chemotherapy |
Fig. 2Performance comparison of different classifiers. The predictive performance of three different classifiers namely AdaBoost (with decision trees as weak learners), Random Forest (RF) and Support Vector Machine (SVM) were assessed. a Predictive features were selected randomly; the null distributions were set using 500 sets of randomly chosen miRNAs. The distributions of classifiers’ accuracy, specificity and sensitivity (with ‘long survival’ as positive class) as well as functional relevance scores were plotted. Mean values are marked on density plots. b The predictive features were the set of differentially expressed genes (KS test, p value < 0.05); error bars show standard deviations
Fig. 4Identified plasma miRNA signature of CRC prognosis. A prognostic signature of 11 plasma miRNAs was identified using the proposed network-based multi-objective optimisation approach. a Boxplots represent the distributions of miRNA expressions across short and long survival samples. b The expression values of the identified miRNAs were examined in an independent public data set of qPCR miRNA profiles obtained from CRC plasma of patients at early or late cancer stages (accession no: GSE67075). Early-stage vs advanced cancer was compared using non-parametric Kolmogorov−Smirnov hypothesis testing. The bar in front of each miRNA shows the achieved p value scaled by –log10 to improve visibility. ‘NA’ indicates that the corresponding miRNA was not profiled (or filtered out) in the data set; ‘*’ specifies differentially expressed miRNAs based on the p value cut-off of 0.1. c List of important overrepresented KEGG pathways and their corresponding –log10 scaled p values, related to CRC mechanisms and inflammation that is an important risk factor for the development of colon cancer
Fig. 5Performance comparison over three other miRNA data sets. The proposed multi-objective optimiser and four benchmark methods were used to identify signatures of disease phenotypes in three publicly available data sets. The performance measures (i.e., accuracy, sensitivity and specificity over test samples, functional relevance (FR), signature size and stability based on Jaccard index) of compared methods were aggregated across 25 independent runs (five runs of fivefold CV). Bar charts represent the average values and error bars show standard deviations. a GSE63108: circulating exosomal miRNA expression profiles in oesophageal adenocarcinoma and normal samples. b GSE76260: miRNA expression profiling in prostate cancer tumours vs non-neoplastic tissues. Bi-objective GA searches for signatures that simultaneously minimise the error rates and the inverse of FR. Tri-objective GA minimises error rate, 1/FR and signature size simultaneously. Increasing the number of objectives increases the number of Pareto front solutions. In tri-objective GA, a Pareto front solution performing better with respect to the first objective has been chosen in each run. c GSE70754: miRNA expression profiles in locally advanced breast cancer tumour vs normal tissues