| Literature DB >> 30255801 |
Yasser El-Manzalawy1,2,3, Tsung-Yu Hsieh1,4,2, Manu Shivakumar5, Dokyoon Kim6,7, Vasant Honavar8,9,10,11,12.
Abstract
BACKGROUND: Large-scale collaborative precision medicine initiatives (e.g., The Cancer Genome Atlas (TCGA)) are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), DNA methylation, miRNA, gene expression, and protein expression, offer tantalizing possibilities for realizing the promise and potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, such analyses present a number of challenges, including heterogeneity, and high-dimensionality of omics data.Entities:
Keywords: Cancer survival prediction; Machine learning; Multi-omics data integration; Multi-view feature selection
Mesh:
Substances:
Year: 2018 PMID: 30255801 PMCID: PMC6157248 DOI: 10.1186/s12920-018-0388-0
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
TCGA ovarian cancer omics data used in this study
| Data source | Platform | Number of samples | Number of features | Number of features with high variance |
|---|---|---|---|---|
| CNA | Affymetrix SNP 6 | 579 | 24,777 | 7355 |
| Methylation | Illumina Infinium HumanMethylation27k | 616 | 27,579 | 6206 |
| GE RNA-Seq | Illumina HiSeq | 308 | 30,531 | 283 |
Notations
| Symbol | Definition and Description |
|---|---|
| D = < X, y> | Labeled dataset where |
| xi | |
| g(xi, xj) | Function that returns the redundancy between two features |
| f(xi, y) | Function that returns the relevance between a feature |
| S | Indices of selected features |
| Ω | Indices of all features |
| Ω | Indices of candidate features Ω − S |
| k | Number of features to be selected |
| v | Number of views in a multi-view dataset |
| MVD = < (X1, …, Xv), y> | Labeled multi-view dataset where |
| Di = < Xi, y> | |
|
| |
| Si | Indices of selected features from ith view |
| Ω | Indices of all features in |
|
| Indices of candidate features Ω |
Fig. 1Two-stage framework for integrating multi-omics data. E refers to the enable signal for the ith view-specific filter. F refers to the set of features selected from the ith view using the ith filter
Average AUC scores of RF, XGB, and LR models trained on CNA data, estimated using 10 runs of 5-fold cross validation
| # Features | RF | XGB | LR |
|---|---|---|---|
| 10 | 0.57 | 0.56 | 0.58 |
| 20 | 0.61 | 0.61 | 0.61 |
| 30 | 0.61 | 0.61 | 0.61 |
| 40 | 0.63 | 0.62 | 0.61 |
| 50 | 0.64 | 0.64 | 0.62 |
| 60 | 0.65 | 0.65 | 0.63 |
| 70 | 0.65 | 0.65 | 0.63 |
| 80 | 0.65 | 0.65 | 0.62 |
| 90 | 0.66 | 0.66 | 0.63 |
| 100 | 0.66 | 0.66 | 0.62 |
| Max | 0.66 | 0.66 | 0.63 |
| Avg. | 0.63 | 0.63 | 0.62 |
Average AUC scores of RF, XGB, and LR models trained on methylation data, estimated using 10 runs of 5-fold cross validation
| # Features | RF | XGB | LR |
|---|---|---|---|
| 10 | 0.52 | 0.51 | 0.50 |
| 20 | 0.51 | 0.52 | 0.50 |
| 30 | 0.52 | 0.52 | 0.49 |
| 40 | 0.52 | 0.53 | 0.50 |
| 50 | 0.52 | 0.53 | 0.51 |
| 60 | 0.52 | 0.53 | 0.52 |
| 70 | 0.53 | 0.54 | 0.51 |
| 80 | 0.53 | 0.54 | 0.52 |
| 90 | 0.53 | 0.55 | 0.52 |
| 100 | 0.53 | 0.55 | 0.52 |
| Max | 0.53 | 0.55 | 0.52 |
| Avg. | 0.52 | 0.53 | 0.51 |
Average AUC scores of RF, XGB, and LR models trained on RNA-Seq data, estimated using 10 runs of 5-fold cross validation
| # Features | RF | XGB | LR |
|---|---|---|---|
| 10 | 0.58 | 0.57 | 0.59 |
| 20 | 0.60 | 0.58 | 0.61 |
| 30 | 0.61 | 0.60 | 0.63 |
| 40 | 0.62 | 0.61 | 0.64 |
| 50 | 0.62 | 0.61 | 0.65 |
| 60 | 0.63 | 0.60 | 0.66 |
| 70 | 0.63 | 0.60 | 0.64 |
| 80 | 0.64 | 0.60 | 0.65 |
| 90 | 0.63 | 0.61 | 0.65 |
| 100 | 0.64 | 0.61 | 0.65 |
| Max | 0.64 | 0.61 | 0.66 |
| Avg. | 0.62 | 0.60 | 0.64 |
Fig. 2Performance comparisons of multi-view models using four different relevance functions for MRMR-mv and three machine learning classifiers, a) RF, b). XGB, and c) LR
Fig. 3Relationship between the number of selected MV features and sensitivity of MV models to changes in selection probability distribution P in terms of percent relative range in AUC
Fig. 4Performance comparisons of final multi-view models with their single-view counterparts, for three different choices of machine learning algorithms: a) RF, b) XGB, and c) LR
Top 20 features selected from CNA and RNA-Seq views
| CNA | Score | RNA-Seq | Score |
|---|---|---|---|
| TBX18 | 0.44 | OVGP1 | 0.56 |
| TSHZ2 | 0.42 | TOX3 | 0.54 |
| RN7SL781P | 0.42 | SIX3 | 0.52 |
| MAN1A2 | 0.42 | HTR3A | 0.50 |
| KIF13B | 0.40 | FLG | 0.48 |
| DKFZP667F0711 | 0.36 | SOSTDC1 | 0.48 |
| CD70 | 0.36 | EPYC | 0.48 |
| PRDM1 | 0.36 | OBP2B | 0.48 |
| ZNF471 | 0.34 | FBN3 | 0.46 |
| RPS19 | 0.34 | COL6A6 | 0.46 |
| snoU13 | 0.34 | NKAIN4 | 0.46 |
| IRX1 | 0.32 | LY6K | 0.44 |
| MIA | 0.32 | FABP6 | 0.44 |
| LYPLA1 | 0.30 | KIF1A | 0.44 |
| SHROOM3 | 0.30 | KCNJ16 | 0.44 |
| USP13 | 0.30 | PNOC | 0.42 |
| SFRP1 | 0.28 | TKTL1 | 0.42 |
| CYP11A1 | 0.28 | HLA-DRB6 | 0.42 |
| ZMYM4 | 0.28 | KRT14 | 0.42 |
| APCDD1L | 0.28 | DPP10 | 0.40 |
Fig. 5Integrative network view of the features selected from CNA and RNA-Seq views. The genes corresponding to the selected features are highlighted by a thicker black outline. The rest of the nodes correspond to the genes that are frequently altered and are known to interact with the highlighted genes (based on publicly available interaction data. The nodes are gradient color-coded according to the alteration frequency based on CNA and RNA-Seq data derived from the TCGA ovarian cancer dataset via cBio Portal