Literature DB >> 32714870

Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling.

Marco Chierici¹, Nicole Bussola^1,2, Alessia Marcolini¹, Margherita Francescatto^1,3, Alessandro Zandonà⁴, Lucia Trastulla⁵, Claudio Agostinelli², Giuseppe Jurman¹, Cesare Furlanello^1,6.

Abstract

Recent technological advances and international efforts, such as The Cancer Genome Atlas (TCGA), have made available several pan-cancer datasets encompassing multiple omics layers with detailed clinical information in large collection of samples. The need has thus arisen for the development of computational methods aimed at improving cancer subtyping and biomarker identification from multi-modal data. Here we apply the Integrative Network Fusion (INF) pipeline, which combines multiple omics layers exploiting Similarity Network Fusion (SNF) within a machine learning predictive framework. INF includes a feature ranking scheme (rSNF) on SNF-integrated features, used by a classifier over juxtaposed multi-omics features (juXT). In particular, we show instances of INF implementing Random Forest (RF) and linear Support Vector Machine (LSVM) as the classifier, and two baseline RF and LSVM models are also trained on juXT. A compact RF model, called rSNFi, trained on the intersection of top-ranked biomarkers from the two approaches juXT and rSNF is finally derived. All the classifiers are run in a 10x5-fold cross-validation schema to warrant reproducibility, following the guidelines for an unbiased Data Analysis Plan by the US FDA-led initiatives MAQC/SEQC. INF is demonstrated on four classification tasks on three multi-modal TCGA oncogenomics datasets. Gene expression, protein expression and copy number variants are used to predict estrogen receptor status (BRCA-ER, N = 381) and breast invasive carcinoma subtypes (BRCA-subtypes, N = 305), while gene expression, miRNA expression and methylation data is used as predictor layers for acute myeloid leukemia and renal clear cell carcinoma survival (AML-OS, N = 157; KIRC-OS, N = 181). In test, INF achieved similar Matthews Correlation Coefficient (MCC) values and 97% to 83% smaller feature sizes (FS), compared with juXT for BRCA-ER (MCC: 0.83 vs. 0.80; FS: 56 vs. 1801) and BRCA-subtypes (0.84 vs. 0.80; 302 vs. 1801), improving KIRC-OS performance (0.38 vs. 0.31; 111 vs. 2319). INF predictions are generally more accurate in test than one-dimensional omics models, with smaller signatures too, where transcriptomics consistently play the leading role. Overall, the INF framework effectively integrates multiple data levels in oncogenomics classification tasks, improving over the performance of single layers alone and naive juxtaposition, and provides compact signature sizes.

Entities: Chemical Disease Gene Species

Keywords: classification; multi-omics; network; oncogenomics; predictive modeling

Year: 2020 PMID： 32714870 PMCID： PMC7340129 DOI： 10.3389/fonc.2020.01065

Source DB: PubMed Journal: Front Oncol ISSN： 2234-943X Impact factor: 6.244

1. Introduction

The challenge of integrating multi-omics data is as old as bioinformatics itself (1, 2), but, despite the wide literature, it remains an open issue nowadays, even worth being funded by major institutions. This study introduces Integrative Network Fusion (INF), a reproducible network-based framework for high-throughput omics data integration that leverages machine learning models to extract multi-omics predictive biomarkers. Originally conceptualized and tested on multi-omics metagenomics data in an early preliminary version (3, 4), INF combines the signatures retrieved from both the early-integration approach of variable juxtaposition (juXT) and an intermediate-integration approach [SNF, (5)], to find the optimal set of predictive features. In particular, first a set of top-ranked features is extracted by juXT by a classifier, here Random Forest (RF) and linear Support Vector Machine (LSVM). Then, a feature ranking scheme (rSNF) is computed on SNF-integrated features and finally a RF model (rSNFi) is trained on the intersection of two sets of top-ranked features from juXT and rSNF, obtaining an approach that effectively integrates multiple omics layers and provides compact predictive signatures. Selection bias and data-leakage effects are controlled by performing the experiments within a rigorous Data Analysis Plan (DAP) to warrant reproducibility, following the guidelines of the US FDA-led initiatives MAQC/SEQC (6–8). In particular, to alleviate the computational burden of the full DAP pipeline, an approximated DAP is designed to lighten computing without significantly affecting the results. Further, experiments are run on samples with randomly shuffled labels as a sanity check vs. overfitting effects and, finally, INF robustness is verified by testing on different train/test splits. We test INF on three datasets retrieved from the TCGA repository, to predict either the estrogen receptor status (ER) or the cancer subtype on the breast invasive carcinoma (BRCA) dataset, and to predict the overall survival (OS) on the kidney renal clear cell carcinoma (KIRC) and acute myeloid leukemia (AML) datasets. Overall, INF improves over the performance of single layers and naive juxtaposition on all four oncogenomics tasks, extracting a biologically meaningful compact set of predictive biomarkers. Notably, the transcriptomics layer is prevalent inside the inferred INF signatures, consistently with published findings (9). The INF framework is currently designed to integrate an arbitrary number of one-dimensional omics layers. We plan to further extend the framework by enabling the integration of histopathological features extracted from whole slide images (10) or deep features from radiological images (11) extracted by deep neural network architectures, carefully addressing all potential caveats (12).

2. Materials and Methods

2.1. Data

Three multi-modal cancer datasets generated by The Cancer Genome Atlas (TCGA) Research Network (https://www.cancer.gov/tcga) and four classification tasks are considered in this study. Protein expression (prot), gene expression (gene), and copy number variants (cnv) are used to predict breast invasive carcinoma (BRCA) estrogen receptor status (0: negative; 1: positive) and subtypes (luminal A, luminal B, basal-like, HER2-enriched). Methylation (meth), gene expression (gene), and microRNA expression (mirna) are used to predict acute myeloid leukemia (AML) and kidney renal clear cell carcinoma (KIRC) overall survival (0: alive; 1: deceased). The number of samples and features for each omic layer and classification task are detailed in Table 1; class balance, split by dataset, is reported in Table 2.

Table 1

Data summary.

Dataset-task	#Samples	Layers (#features)
BRCA-ER	381	gene (17814), cnv (18050), prot (142)
BRCA-subtypes	305
AML-OS	157	gene (10265), meth (2500), mirna (352)
KIRC-OS	181	gene (10265), meth (2500), mirna (484)
Synthetic-ST	380	layer1 (100), layer2 (50), layer3 (250)

BRCA, breast invasive carcinoma; AML, acute myeloid leukemia; KIRC, kidney renal clear cell carcinoma; gene, gene expression; cnv, copy number variants; prot, protein expression; meth, methylation; mirna, microRNA expression; ER, estrogen receptor; subtypes, breast cancer subtypes; OS, overall survival; ST, synthetic target.

Table 2

Class balance.

Dataset-task	Labels (#samples)
BRCA-ER	Negative (95), Positive (286)
BRCA-subtypes	LuminalA (170), LuminalB (102), Basal-like (81), HER2-enriched (48)
AML-OS	Dead (101), Alive (56)
KIRC-OS	Dead (133), Alive (48)

BRCA, breast invasive carcinoma; AML, acute myeloid leukemia; KIRC, kidney renal clear cell carcinoma; ER, estrogen receptor; subtypes, breast cancer subtypes; OS, overall survival.

Data summary. BRCA, breast invasive carcinoma; AML, acute myeloid leukemia; KIRC, kidney renal clear cell carcinoma; gene, gene expression; cnv, copy number variants; prot, protein expression; meth, methylation; mirna, microRNA expression; ER, estrogen receptor; subtypes, breast cancer subtypes; OS, overall survival; ST, synthetic target. Class balance. BRCA, breast invasive carcinoma; AML, acute myeloid leukemia; KIRC, kidney renal clear cell carcinoma; ER, estrogen receptor; subtypes, breast cancer subtypes; OS, overall survival. For AML (13) and KIRC (14), gene expression is profiled using the Illumina HiSeq2000 and quantified as log2-transformed RSEM normalized counts; miRNA mature strand expression is profiled using the Illumina Genome Analyzer and quantified as reads per million miRNA mapped; and methylation is assessed by Illumina Human Methylation 450K and expressed as beta values. For BRCA (15), gene expression is profiled with Agilent 244K custom gene expression microarrays; protein expression is assessed by reverse phase protein arrays; copy number profiles are measured using Affymetrix Genome-Wide Human SNP Array 6.0 platform, copy number variants are segmented by the TCGA Firehose pipeline using GISTIC2 method, and then mapped to genes. The original data is publicly accessible on the National Cancer Institute GDC Data Portal (https://portal.gdc.cancer.gov/) and the Broad GDAC Firehose (https://gdac.broadinstitute.org/), where further details on data generation can be found. The data was retrieved in December, 2019 and January, 2020 using the RTCGA R library (16). Furthermore, the INF pipeline has been tested on a synthetic dataset with 380 observations in two classes (70% class 1 and 30% class 2, defining the synthetic target ST), 3 pseudo-omics layers, and 400 features (layer 1: 100; layer 2: 50; layer 3: 250). The dataset is generated in-house using scikit-learn's make_classification function with the arguments shuffle=False and flip_y=0. The number of informative features and the difficulty of the task were set on a per-layer basis, as summarized in Table 3.

Table 3

Synthetic data summary for each simulated layer.

Layer	# Features	# Informative features	Multiplicative factor	Class separation	Random state
Layer 1	100	10	Default	1.0	1
Layer 2	50	5	Default	1.2	2
Layer 3	250	25	10	0.8	3

Multiplicative factor, class separation, and random state refer to the parameters .

Synthetic data summary for each simulated layer. Multiplicative factor, class separation, and random state refer to the parameters .

2.2. In silico Workflow

The INF pipeline integrates two or more omics layers, e.g., gene expression, protein expression, or methylation, in a machine learning framework for improved patient classification and biomarker identification in cancer. The core consists of three main components, structured as in Figure 1, managing the integration of the omics layers and their predictive modeling. A baseline integration method (juXT) is first considered by training a Random Forest (RF) (17) or a linear Support Vector Machine (LSVM) (18) classifier on juxtaposed multi-omics data, ranking features by ANOVA F-value. Secondly, the multi-omics features are integrated by Similarity Network Fusion (SNF) (5), a method that computes a sample similarity network for each data type and fuses them into one network. INF introduces a novel feature ranking scheme (rSNF) that sorts multi-omics features according to their contribution to the SNF-fused network structure. A RF or LSVM classifier is trained on the juxtaposed multi-omics data, ranking features by rSNF. A compact RF model (rSNFi) is finally trained on the juxtaposed dataset restricted on the intersection of top-ranked biomarkers from juXT and rSNF.

Figure 1

Graphical representation of the INF workflow for N omics datasets with K phenotypes. A first RF or LSVM classifier is trained on the juxtaposed data, ranking features by ANOVA F-value (juXT). The data sets are then integrated by Similarity Network Fusion, the features are ranked by rSNF and a RF or LSVM model is developed on the juxtaposed dataset with the rSNF feature ranking (rSNF). Finally, a RF or LSVM classifier is trained on the juxtaposed dataset restricted to the intersection of juXT and rSNF top discriminant feature lists (rSNFi). The classifier is either RF or LSVM throughout the INF workflow. All the predictive models are developed within the DAP described in the methods and graphically represented in Figure 2. The alternative and mutually exclusive paths A and F are followed by the “accelerated DAP” and the “full DAP” procedures, respectively (see section 2).

Figure 2

Diagram of the Data Analysis Plan (DAP), originally developed within the FDA-led MAQC/SEQC-II initiatives. If the training set labels are stochastically shuffled beforehand, the DAP runs in “random labels” mode as a sanity check to ensure that the procedure is not affected by systematic bias.

2.3. Omics Integration

In a comparative review of scientific literature, SNF (5) emerged as one of the most reliable alternatives to simple juxtaposition-based integration. SNF is a non-Bayesian network-based method that can be divided into two main steps: the first step builds a sample-similarity network for each omics dataset, where nodes represent samples and edges encode a scaled exponential Euclidean distance kernel computed on each pair of samples; the second step implements a non-linear combination of these networks into a single similarity network through an iterative procedure. The multi-omics datasets are first converted into graphs, and for each graph two matrices are computed: a patient pairwise similarity matrix (“status matrix”), and a matrix with similarity of each patient to the K most similar patients, through K-nearest neighbors (“local affinity matrix”). At each iteration, the status matrix is updated through the local affinity matrix, generating two parallel interchanging processes. The status matrices are finally fused together into a single network. Spectral clustering is performed on the fused network, in order to identify sub-communities of samples, potentially reflecting phenotypes. The clustering performance is evaluated with respect to a ground truth, i.e., the real phenotype each sample belongs to, by the Normalized Mutual Information (NMI) score. SNF integrates multiple omics datasets into a single comprehensive network in the space of samples rather than measurements (e.g., gene expression values). This work proposes multi-omics integration as an approach to identify robust biomarkers of samples phenotypes or cancer subtypes (e.g., survival status vs. breast cancer subtyping); consequently, it is necessary to extract measurements information from the SNF-fused network of samples. To this aim, we extended SNF by implementing rSNF (ranked SNF), a feature-ranking scheme based on SNF-fused network clustering. In detail, a patient network W is built for each feature f, based on f alone, and spectral clustering is performed on it. Then, NMI score is computed comparing the samples clusters found inside W with those in the fused network; the higher the score, the more similar the clustering between the fused network and W. Thus, each feature f is associated to a consistency score, ranking all multi-omics features with respect to their relative contribution to the whole network structure. The entire procedure of similarity networks inference and fusion relies on two hyperparameters: α, the scaling variance in the scaled exponential similarity kernel used for similarity networks construction, and K, the number of nearest neighbors in sparse kernel and scaled exponential similarity kernel construction. While the original method (5) assigned fixed values to α and K, in this study the optimal hyperparameters are chosen among the grids α = {0.3, 0.35, 0.4, 0.45, …, 0.8} and K = {i∈ℕ, 10 ≤ i ≤ 30} in a 10 × 5-fold cross-validation schema.

2.4. Predictive Profiling

To ensure the reproducibility of results and limit overfitting, the development of classification models is performed inside a Data Analysis Plan (DAP) (Figure 2), following the guidelines derived by the U.S. Food and Drug Administration MAQC/SEQC studies (6, 19). Data is split in a training set (TR) and two non-overlapping test sets (TS, TS2), preserving the original proportion of patient phenotypes (classes). The TR/TS/TS2 partitions are 50/30/20 of the entire data set, respectively. The data splitting procedure is repeated 10 times so to obtain 10 different TR/TS/TS2 splits. Predictive models are trained and developed on TR and TS for juXT and rSNF; in the case of rSNFi, the models are trained and developed on TS and TS2 to avoid information leakage due to using the same data both for feature selection and model training (see Figure 3). For each split, Random Forest (RF) or linear kernel Support Vector Machine (LSVM) classifiers are trained on the training partition within a stratified 10 × 5-fold cross-validation (10 × 5-CV). The model performance is assessed in terms of average precision, recall and Matthews Correlation Coefficient (MCC) (20, 21). The MCC is generally regarded as a balanced measure of accuracy and precision that can be used both in binary and multiclass problems (22, 23) and even when classes are imbalanced (24). MCC lies in [−1, 1], with 1 meaning perfect prediction, -1 inverse prediction and 0 random guess. For binary classification tasks, MCC is calculated on true and predicted labels considering true positive (TP), true negative (TN), false positive (FP) and false negative (FN) values, as in the following:

Figure 3

Data splitting procedure. To avoid information leakage due to the use of the same data both for feature selection and model training, we considered different train and test sets according to the integration scheme. In particular, each data set is split into three non-overlapping partitions (TR/TS/TS2), corresponding to the 50/30/20% of the entire data set, respectively. The TR/TS/TS2 partitions preserve the original proportion of patient phenotypes. Predictive models for juXT and rSNF are trained on TR and validated on TS, while for rSNFi the train set is TS (with features restricted to the intersected biomarkers of juXT and rSNF) and TS2 the test set.

At each CV round, features are ranked either by ANOVA F-value (for juXT, rSNFi) or by the rSNF ranking (see section 2.3) and different classification models are trained for increasing numbers of ranked features, namely 5, 10, 25, 50, 75, and 100% of the total features. A unified list of top-ranked features is then obtained by Borda aggregation of all the ranked CV lists (25, 26). The best model is later retrained on the whole training set restricted to the features yielding the maximum MCC in CV, and validated on the test partition. A global list of top-ranked features is derived for juXT, rSNF, and rSNFi by Borda aggregation of the Borda lists of each TR/TS split (Borda of Bordas, “BoB”). The signatures for juXT, rSNF, and rSNFi are defined by the top N features of the corresponding BoB lists, with N being the median size of top features across all experiments. Diagram of the Data Analysis Plan (DAP), originally developed within the FDA-led MAQC/SEQC-II initiatives. If the training set labels are stochastically shuffled beforehand, the DAP runs in “random labels” mode as a sanity check to ensure that the procedure is not affected by systematic bias. Data splitting procedure. To avoid information leakage due to the use of the same data both for feature selection and model training, we considered different train and test sets according to the integration scheme. In particular, each data set is split into three non-overlapping partitions (TR/TS/TS2), corresponding to the 50/30/20% of the entire data set, respectively. The TR/TS/TS2 partitions preserve the original proportion of patient phenotypes. Predictive models for juXT and rSNF are trained on TR and validated on TS, while for rSNFi the train set is TS (with features restricted to the intersected biomarkers of juXT and rSNF) and TS2 the test set. In the “full” version of the DAP (fDAP), described above, the rSNF ranking is performed at each CV round on the training portion of the data. Since this procedure is quite demanding in terms of computational time, even if parallelized (≈ 9 feature/min), we devised an “accelerated” version of the DAP (aDAP), where the rSNF ranking is precomputed on the whole TR data and used as is at each CV round. We assessed the fDAP vs. aDAP performance on the synthetic dataset as well as BRCA-ER and BRCA-subtypes by comparing the overall metrics and measuring the dissimilarity of the rSNF BoB of the two DAPs by the Canberra distance (25). RF models are trained using 500 trees, measuring the quality of a split as mean decrease in the Gini impurity index (17); the regularization parameter C of LSVM models is tuned over the grid within a 10 × stratified Monte Carlo cross-validation (50% training/validation proportion). Results for RF models are summarized in Table 4, while LSVM models performance is detailed in the Supplementary Tables BRCA-ER_LSVM, KIRC-OS_LSVM.

Table 4

Summarized best predictive performances for each classification task using RF model and three omics layers.

Task	Method	MCC_cv (CI)	MCC_ts (CI)	PREC_cv (CI)	PREC_ts (CI)	REC_cv (CI)	REC_ts (CI)	Nf
BRCA-ER	juXT	0.785 (0.776, 0.795)	0.797 (0.778, 0.819)	0.935 (0.932, 0.938)	0.946 (0.935, 0.957)	0.962 (0.959, 0.965)	0.955 (0.949, 0.962)	1801
	rSNF	0.792 (0.782, 0.801)	0.804 (0.779, 0.830)	0.938 (0.935, 0.941)	0.947 (0.934, 0.961)	0.961 (0.958, 0.965)	0.958 (0.949, 0.966)	1801
	rSNFi	0.820 (0.808, 0.831)	0.830 (0.803, 0.857)	0.955 (0.951, 0.959)	0.951 (0.939, 0.962)	0.956 (0.952, 0.960)	0.967 (0.956, 0.977)	55.5
BRCA-subtypes	juXT	0.778 (0.771, 0.785)	0.795 (0.771, 0.817)	-	-	-	-	1801
	rSNF	0.769 (0.762, 0.777)	0.811 (0.787, 0.835)					1801
	rSNFi	0.788 (0.778, 0.798)	0.838 (0.794, 0.879)					301.5
KIRC-OS	juXT	0.266 (0.243, 0.289)	0.305 (0.229, 0.382)	0.540 (0.509, 0.570)	0.579 (0.494, 0.664)	0.299 (0.280, 0.317)	0.343 (0.300, 0.393)	2319
	rSNF	0.253 (0.230, 0.276)	0.274 (0.189, 0.348)	0.539 (0.505, 0.571)	0.628 (0.507, 0.739)	0.253 (0.235, 0.270)	0.257 (0.200, 0.314)	3313
	rSNFi	0.268 (0.239, 0.298)	0.378 (0.288, 0.464)	0.485 (0.449, 0.521)	0.594 (0.512, 0.668)	0.321 (0.296, 0.347)	0.490 (0.380, 0.600)	111
AML-OS	juXT	0.141 (0.120, 0.163)	0.223 (0.146, 0.307)	0.675 (0.669, 0.681)	0.704 (0.682, 0.725)	0.860 (0.849, 0.870)	0.880 (0.850, 0.907)	6559
	rSNF	0.180 (0.157, 0.202)	0.263 (0.175, 0.366)	0.685 (0.679, 0.691)	0.717 (0.692, 0.743)	0.876 (0.867, 0.886)	0.873 (0.847, 0.903)	656
	rSNFi	0.274 (0.245, 0.301)	0.176 (0.068, 0.278)	0.726 (0.718, 0.735)	0.673 (0.639, 0.706)	0.870 (0.858, 0.882)	0.835 (0.785, 0.880)	91.5

CI: 95% bootstrap confidence interval; {MCC,PREC,REC}_cv: best average MCC, precision, recall in cross-validation on training set splits; {MCC,PREC,REC}_ts: average MCC, precision, recall on test set splits; Nf: median number of features leading to MCC_cv. Bold indicates best performance (highest MCC and smallest signature size). Precision and recall were computed for binary classification tasks only.

Summarized best predictive performances for each classification task using RF model and three omics layers. CI: 95% bootstrap confidence interval; {MCC,PREC,REC}_cv: best average MCC, precision, recall in cross-validation on training set splits; {MCC,PREC,REC}_ts: average MCC, precision, recall on test set splits; Nf: median number of features leading to MCC_cv. Bold indicates best performance (highest MCC and smallest signature size). Precision and recall were computed for binary classification tasks only. To ensure that the predictive profiling procedure is not affected by selection bias, the whole INF workflow, including the rSNF procedure, is also repeated after randomly scrambling the training set labels (“random labels” mode): in this setup, the performance of a classifier unaffected by systematic bias should be close to that of a random predictor, with MCC close to zero.

2.5. Implementation

The complete INF pipeline is implemented through the workflow management tool Snakemake (27, 28), which allows automatic handling of all dependencies required to generate the INF output. The pipeline operates on N omics input files, one for each layer that should be integrated, and a single file describing the patient labels. The omics files are tab-separated text matrices with patients on the rows and features on the columns, with row and column identifiers. The label file is a single column file with patient phenotypes, with no header. This input structure, with one file per omic layer and a label file, simplifies the downstream analysis and reduces to a minimum the preprocessing burden for the end user. The predictive profiling module, including the DAP, is written in Python 3.6 on top of NumPy (29) and scikit-learn methods (30). The ranked SNF (rSNF) procedure is implemented in R (31) leveraging the original R scripts provided by SNF authors (5), extended by a dedicated script for SNF tuning and a main script for SNF analysis and the post-SNF feature selection procedure, which is parallelized over the features for efficiency using the foreach R library.

2.6. Computational Details

The INF computations were run on the FBK Linux high-performance computing facility KORE, on a 8-core i7 3.4 GHz Linux workstation, and on a 72-vCPU 2.7 GHz Platinum Intel Xeon 8168 Microsoft Azure cloud machine (F72s v2 series).

2.7. Data and Code Availability

To further foster reproducibility and support users and future developers, the full code of this benchmark is publicly shared on the GitLab repository https://gitlab.fbk.eu/MPBA/INF. Additional information is included in the Supplementary Material available on the publisher's website, while the full set of experimental data can be accessed at http://dx.doi.org/10.6084/m9.figshare.12052995.v1.

3. Results

The INF workflow was run on all tasks considering 3-layer integration and all 2-layer combinations; the DAP was also run separately on all single-layer datasets in order to obtain a baseline. All results presented here refer to experiments performed with RF classifier. Experiments using LSVM were performed on BRCA-ER and KIRC-OS obtaining similar classification performances, top features and layer contributions (Supplementary Tables BRCA-ER_LSVM, KIRC-OS_LSVM). The classifier performance for 3-layer integration is summarized in Table 4, in terms of average cross-validation MCC on the 10 training set splits (MCC_cv) with 95% Studentized bootstrap confidence intervals (CI) as (MCC_min, MCC_max), average MCC on the 10 test set splits (MCC_ts) with CI, and median number of features (Nf) yielding MCC_cv. Similarly, precision (PREC) and recall (REC) are reported in Table 4 as average cross-validation and test set values with CI. As expected, whenever there is a non-negligible unbalance toward the positive class, the number of false positives tends to increase, with more false positives yielding a comparatively low precision with higher recall, and vice versa. In both cases, the MCC efficiently works in balancing the two effects. The classifier performance on single-layer and 2-layer data is summarized in Figure 4.

Figure 4

Overview of Random Forest classification performance (MCC, Matthews Correlation Coefficient) on the four tasks in cross validation (“CV”) and test (“ts”), on single-layer (blue shades) and on all two-layer combinations for juXT (orange), rSNF (red) and rSNFi (green). Bars indicate 95% confidence intervals. On top of each CV-ts pair is the median number of features leading to best CV performance. A comparison between the “accelerated” flavor of the DAP (aDAP) and the full DAP (fDAP) was run on synthetic data, BRCA-ER and BRCA-subtypes data, with aDAP yielding similar performance metrics and top-ranked biomarker lists as fDAP (Supplementary Tables Synthetic_RF, BRCA_RF_fDAP, canberra_distances), while being ≈30 × faster (for BRCA-ER, approx. 2 vs. 64 h, or 300 features/min vs. 9 features/min). All the results presented here were thus obtained using aDAP. Moreover, the INF workflow running in “random labels” mode achieved an average cross-validation MCC ≈0, as expected by a procedure unaffected by systematic bias. Overall, integrating multiple omics layers with INF yields better or comparable classification performance than using only features from a single layer or naïve omics juxtaposition, at the same time with much more compact signature sizes. On 3-layer BRCA-subtypes and 2- or 3-layer KIRC-OS, INF outperforms the single layers, as well as juXT and rSNF (Figure 4, Table 4). On 2-layer BRCA-subtypes, INF performance on gene-cnv and gene-prot is comparable to the best-performing single-layer data (gene) and superior to cnv and prot single layers, while INF on cnv-prot only improves over the cnv single layer. On the BRCA-ER task, the performance with INF integration of 2 or 3 layers is still better than using single layers, nevertheless to a smaller extent, except for cnv-prot integration which performs better than cnv alone but slightly worse than gene and prot single layers. The good performances achieved at the gene and prot single layers do not come unexpected, since the biological nature of the target ER-status is defined at transcriptomics level. On the more difficult AML-OS task, INF has better performance over both rSNF and juXT on gene-mirna and meth-mirna integration, still improving over single-layer performance both in terms of MCC and reduced signature sizes.

3.1. One or Multi-Omics Layers vs. juXT/rSNF/rSNFi

For BRCA-ER, three-layer INF (rSNFi) integration performs better than either rSNF or juXT (MCC test 0.830 vs. 0.804, 0.797 for rSNF and juXT, respectively). All two-layer INF integrations perform similarly to, or better than, the corresponding rSNF and juXT integrations, in particular for cnv-prot integration (MCC test 0.746 vs. 0.682, 0.692 resp. for rSNF and juXT). On BRCA-subtypes, the 3-layer INF integration performs better than either rSNF or juXT (MCC test 0.838 vs. 0.811, 0.795 resp. for rSNF and juXT), nevertheless without improving over the gene single-layer performance (MCC test 0.821). However, the INF median signature size is only 301.5, compared to 1801 for rSNF and juXT, and 891 for the gene layer alone. All two-layer INF integrations yield better performance than their corresponding juXT or rSNF integrations. Omics integration is particularly effective for KIRC-OS, as all 2- and 3-layer INF integrations outperform juXT, rSNF, and each of the single-layer classifiers. In fact, 3-layer rSNFi achieves MCC test 0.378 vs. 0.274, 0.305 (resp. for juXT, rSNF), 0.296, 0.327, 0.333 (resp. rSNFi meth-mirna, gene-mirna, gene-meth), and 0.253, 0.261, 0.249 (resp. gene, meth, mirna). For AML-OS, INF feature sets are always more compact than either juXT or rSNF, with three-layer integration giving better MCC than any of the INF two-layer integrations (MCC test 0.176 vs. 0.125, 0.169, 0.047, respectively three-layer vs. meth-mirna, gene-mirna, gene-meth). Moreover, cross-validation MCCs corresponding to INF integration are better than any single layer MCC as well as rSNF and juXT.

3.2. Characterization of the Signatures Identified by INF

For all tasks, INF signatures are markedly more compact with respect to both juXT and rSNF. With 91.5 vs. 6559 (1.4%) median features (rSNFi vs. juXT), the largest reduction in size occurs for AML-OS 3-layer integration, while the least reduction is observed for BRCA-subtypes task, with 301.5 vs. 1801 (16.7%) median features (rSNFi vs. juXT). In terms of contributions from the omics datasets being integrated, the gene layer generally provides the largest number of features to the signatures identified by the INF workflow. In particular for the BRCA dataset, in both ER and subtypes tasks, the gene layer contributes over 95% of the top features for juXT and rSNFi, with rSNF signatures being slightly more balanced (prot contribution remains marginal, while cnv provides 28.3 and 17.7% of the top features in ER and subtypes tasks respectively). This is expected as the class label is defined mainly at transcriptomics level. In AML-OS experiments, the layer contributing the most is still gene, accounting for ca. 78, 73, and 81% of the top feature sets for RF juXT, rSNF and rSNFi experiments, respectively. In KIRC-OS experiments, gene is the layer contributing the most to the top juXT and rSNF feature sets, while meth is the major contributor for rSNFi. The percentage of features from each omic layer contributing to the top signatures for juXT, rSNF and rSNFi 3-layer integrations are reported in Supplementary Tables layer_contribution. The RF rSNFi signatures for all tasks are available in Supplementary Tables BRCA-ER_RF_rSNFi, BRCA-subtypes_RF_rSNFi, AML-OS_RF_rSNFi and KIRC-OS_RF_rSNFi. Even though a systematic biological interpretation of the identified signatures is beyond the scope of this work, to ascertain the reliability of our results we compared them with published data. The top features in the BRCA-ER rSNFi signature include multiple genes known to be associated with breast carcinoma progression and outcome such as AGR3, B3GNT, and MLPH (32–34). In addition we find the estrogen receptor gene (ESR1 from the gene and ER-alpha from the prot layer) and the transcription factor GATA3 (from both gene and prot layers) (35). Both the BRCA-ER and BRCA-subtypes signatures include genes previously identified as novel biomarkers for intrinsic breast carcinoma subtype prediction (36). Interestingly there is only partial overlap between the top features identified in BRCA ER vs. subtypes tasks. Considering AML-OS task, it is noteworthy to mention that the top feature identified has been recently reported as a potential biomarker predicting overall survival in a subset of AML patients (37). Within the mirna features of the AML-OS signature, MIR-203 expression was recently found to be associated with AML patient survival (38); MIR-100 is highly expressed in AML and was found to regulate cell differentiation and survival (39); high expression of miR-504-3p was reported to be associated with favorable AML prognosis (40). Given that the rSNFi signature identified in the KIRC-OS task contains a large percentage of methylation data (86.5%), its direct interpretation is more difficult. It is however interesting to observe that all the 15 gene features in the signature are identified as prognostic markers for renal carcinoma according to the Human Protein Atlas (41).

3.3. Unsupervised Analysis

The features selected by juXT, rSNF and rSNFi are projected on a bi-dimensional space using the UMAP unsupervised multidimensional projection method (42, 43). Here we show an example on the BRCA-subtypes 3-layer dataset, with a UMAP projection of the features selected by juXT (Figure 5) compared to the UMAP projection of the INF signature (Figure 6) for one of the 10 data splits (the UMAP plots for the remaining 9 splits are in Figures S1, S2). Colors represent cancer subtypes and shapes represent training/test partitions. Using the 1801 juXT features, cancer subtypes are roughly clustered, with HER2-enriched and Luminal B being more dispersed (Figure 5). The clusters appear to be more sharply defined in the projection of the 302-feature INF signature: in particular, Basal-like patients form a distinct cluster, while Luminal A, Luminal B and HER2-enriched patient clusters are close to each other, slightly overlapping yet hinting to a trajectory pattern (Figure 6). The HER2/luminal cluster contains two patients classified as basal-like subtype, consistently with the findings of (44).

Figure 5

UMAP projection on the BRCA-subtypes task with 3-layer juxtaposed data. Circle, TR set; triangle, TS set; diamond, TS2 set.

Figure 6

UMAP projection on the BRCA-subtypes task with 3-layer juxtaposed data restricted to the rSNFi signature. Circle, TR set; triangle, TS set; diamond, TS2 set.

UMAP projection on the BRCA-subtypes task with 3-layer juxtaposed data. Circle, TR set; triangle, TS set; diamond, TS2 set. UMAP projection on the BRCA-subtypes task with 3-layer juxtaposed data restricted to the rSNFi signature. Circle, TR set; triangle, TS set; diamond, TS2 set.

4. Discussion

4.1. Background and Related Work

Ritchie et al. (45) defined omics data integration as the combination of multiple omics datasets that can be used for the development of models to predict complex traits or phenotypes. The problem of data integration in computational biology is far from having a consolidated and shared solution. Many long-standing obstacles are still far from being overcome, and the increasing availability of data [e.g., TCGA, (46)] and computational tools [see for instance (47–51) and https://github.com/mikelove/awesome-multi-omics], also interactive [e.g., (52)], is raising new issues that need to be addressed. In fact, not only are existing datasets still lacking standardization protocols to deal with their complexity and heterogeneity, but also the reliability, reproducibility and interpretability of new computational methods are emerging as urgent and relevant questions (53). Moreover, modern technologies allow the rapid extraction of high-dimensional, high-throughput features from different sources (e.g., gene expression, DNA sequencing, metabolomics, or high-resolution images), which in turn require collaboration between biologists, computer scientists, physicians and other experts. The lack of common methodologies and terminologies can transform this synergy into a further level of complexity in the process of data integration (54). As observed in (55, 56), specific technological limits, noise levels and variability ranges affect the different omics, and thus confounding the underlying biological signals, yielding that really integrative analysis is still very rare, while different methods often discover different kinds of patterns, as evidenced by the lack of consistency in the published results, although efforts in this direction have started appearing (57, 58). Indeed, the underlying hypothesis of multi-omics integration is that different omics data can provide complementary information (56) [although sometimes redundant (9)], and thus a broader insight with respect to single-layer analysis, for a better understanding of disease mechanisms (59). This assumption has been confirmed by multiple studies on diverse diseases, such as cardiovascular disease (60), diabetes (61), liver disease (62), or mitochondrial diseases (63), and also longitudinally (64), suggesting that the more complex the disease the more advantageous the integration. As the co-occurrence of multiple causes and correlated events is a well-known characteristic of tumorigenesis and cancer development, the integration of data generated from multiple sources can thus be particularly useful for the identification of cancer hallmarks (65–68). Many computational strategies have been introduced that combine multiple types of data to identify novel biomarkers and thus to predict a phenotype of interest or drive the development of intervention protocols. Given the heterogeneity of data and tasks, these techniques deal with the data integration at different levels of the learning process: (i) by concatenating the features before fitting a model (early-integration), (ii) by incorporating the integration step into the model training (intermediate-integration), or (iii) by combining the outputs of distinct models for the final prediction (late-integration) (69, 70). In the early-integration approach, also known as juxtaposition-based, the multi-omics datasets are first concatenated into one matrix. To deal with the high-dimensionality of the joint dataset, these methods generally adopt matrix factorization (55, 56, 58, 71), statistical (47, 49, 58, 60, 62, 72–76), and machine learning tools (58, 76, 77). Alternatively, data models relying on polyglot approaches can be used especially in (bio)informatics applications (78, 79). Although the dimensionality reduction procedure is necessary and may improve the predictive performance, it can also cause the loss of key information (69). Moreover, biomarkers identified purely on a computational statistics rationale from meta-omics features often lack biological plausibility (80). In order to maximize the contribution of the single-omics layer, the late-integration methods first model each dataset individually, and then merge or average the results; they are also known as model-driven (70, 81). Although these techniques avoid the pre-selection of the features, they do not leverage the hidden correlations between the data, posing again the risk of signal loss (80, 82). The intermediate-integration strategies aim at developing a joint model that accounts for the correlation between the omics layers, to boost their combined predictive power (83). Among these methods, the network-based models refer to the reconstruction of a graph representing the complex biological interactions (76, 84), known or predicted, between the variables to discover novel informative relationships (85). They have successfully been applied in cancer research for the identification of pan-cancer drug targets (86), the detection of subtype-specific pathways (83, 87) and of genetic aberrations (88), or the stratification of cancer patients (89–91). In particular, Koh et al. (44) predicted breast cancer subtypes by applying a modified shrunken centroid method in the development of their network-based tool, iOmicsPASS. Further, breast cancer datasets in TGCA represent a benchmark for integrative models (92–94), as well as AML (95). More recently, the success of deep learning algorithms in various bioinformatics fields (96) prompted the adoption of deep neural networks for omics-integration in precision oncology. Autoencoders and convolutional neural networks have been effectively trained for the prediction of prognostic outcomes (9, 97), response to chemotherapeutic drugs (50), and gene targeting (98), by adopting either an early-integration (9, 98) or a late-integration (50, 97). Although deep learning models hold the potential to include image-derived features in the integration workflow, they suffer from interpretability and generalization issues (99). Although it is clear that no single method is consistently preferable, and that most of the proposed approaches are task and/or data dependent (80), the complexity of tumor analysis suggests that network-based approaches are needed (87, 100). In this context, it is clear that omics-integration is one of the most promising and demanding challenges of the modern bioinformatics, and that there is an urgent need to prove the reproducibility, interpretability, and generalization capability of the proposed methods (85, 101).

4.2. Integrative Network Fusion

We present the INF framework for the characterization of cancer patient phenotypes by integrated multi-omics signatures, combining an improved version of a state-of-the-art integration technique (5) with predictive models developed inside a Data Analysis Plan (6) for machine learning. The framework is applied to TCGA data to predict clinically relevant patient phenotypes such as the overall survival or cancer subtypes. The simplest approach for multi-omics data integration consists in juxtaposition of normalized measurements into one joint matrix, followed by the development of a predictive model. Juxtaposition-based integration is considered as a baseline technique, since it is the most naïve approach to combine two datasets; moreover, it enables to identify multi-omics signatures by borrowing discriminatory strength from information derived by all datasets. Juxtaposition further dilutes the already possible low signal-to-noise ratio in each data type, affecting the understanding of the biological interactions at the different omics levels. Conversely, the INF method for omics data integration is an improvement of the popular Similarity Network Fusion (SNF) approach (5), which has inspired several studies in the scientific literature, specifically in cancer genomics (77, 87, 102–106). SNF maximizes the shared or correlated information between multiple datasets by combining data through inference of a joint network-based model, accounting for how informative each data type is to the observed similarity between samples. Two innovative solutions have been implemented in this study: (i) we devised a SNF-based procedure to rank variables according to their importance in clustering samples with similar phenotypes; and (ii) predictive models were developed exploiting the SNF-ranked variables, inside a rigorous Data Analysis Plan which ensures reproducibility (6, 19). The performance of INF was assessed both in terms of statistical properties as well as biological interest. Concerning the statistical aspect, INF was compared with predictive models developed on the juxtaposed datasets (juXT technique), as well as on the single-layer datasets. With INF, smaller signature sizes were systematically derived to achieve comparable or even better performance both in cross-validation and in test. This is an added value for INF, as biological validation of biomarkers can definitely benefit from signatures of small size in terms of both costs and required time. This main achievement is mainly due to the novel rSNF ranking, which increases the signal-to-noise ratio from the combined layers by prioritizing the most discriminant biomarkers in terms of network mutual information. rSNF exploits two main SNF advantages: integration of heterogeneous data and clustering of sample networks. The main peculiarity of the SNF integrative procedure is its robustness to noise (5), because weak similarities among samples (low-weight edges) disappear, except for low-weight edges supported by all networks, which are conserved depending on how tightly connected their neighborhoods are across networks. Moreover, the rSNFi step further increases the signal-to-noise ratio by training a predictive classifier on multi-omics juxtaposed data restricted to the top-ranked biomarkers shared by juXT and rSNF models. The resulting signatures are compact in size (up to 99% reduction w.r.t. juXT) while allowing predictive models to achieve equal or better performance compared to naïve juxtaposition or the single layers alone. While a comprehensive evaluation of the biological meaning of the signatures identified through the INF framework is beyond the scope of this work, we assessed their general validity with a thorough literature search. Our investigation shows that the signatures identified through the INF framework include biological markers that are relevant in the tasks under analysis and are consistent with previously published data. Further, as in (9), the largest contribution in the biomarkers' lists is provided by gene expression, while epigenomics, proteomics and miRNA transcriptomics play a minor role. It should be noted that, especially in computational biology, multicollinearity between pairs of predictors and/or layers is intrinsic in the problem. Nevertheless, most machine learning models are indeed designed to identify the relevant predictors even in the presence of strong linear or non-linear correlations, provided that an appropriate DAP, feature ranking method, and diagnostic tools (e.g., random labels) are adopted against selection bias. To this aim, the application of a DAP derived from the MAQC-II initiative for model selection is a core attribute of the INF framework. A fair comparison of INF results with other integration methods is currently unfeasible due to the number and variety of computational pipelines with dissimilar datasets, preprocessing methods, data analysis plans, and performance metrics. This work is based on the original R implementation of the SNF algorithm (5). However, we are aware that Open Source implementations exist in other programming languages, in particular snfpy for Python (107). In a future release of the INF workflow, we plan to migrate the SNF-related parts to snfpy or a similar Python-based implementation, in order to drop the dependency on R and to potentially improve the overall performance. In its current version, the INF framework supports the integration of two or more one-dimensional omics layers. As part of our future effort we will add support for the integration of medical imaging layers, for example leveraging the extraction of histopathological features from whole slide images by deep learning (10) or using radiomics or deep features from radiological images (11). In both cases, further issues will emerge from the interactions between the omics and the non-omics data, needing particular care in the integration (12).

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary files, further inquiries can be directed to the corresponding author/s.

Author Contributions

CA, LT, and GJ: conceptualization. MC, NB, AM, AZ, LT, CA, and GJ: methodology. MF: interpretation. GJ: coordination. MC, NB, AM, MF, GJ, and CF: writing. All authors contributed to the article and approved the submitted version.

Conflict of Interest

AZ was employed by the company NIDEK Technologies Srl. CF was employed by the company HK3 Lab. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

85 in total

1. Consistency and overfitting of multi-omics methods on experimental data.

Authors: Sean D McCabe; Dan-Yu Lin; Michael I Love
Journal: Brief Bioinform Date: 2020-07-15 Impact factor: 11.622

Review 2. Comparison and evaluation of integrative methods for the analysis of multilevel omics data: a study based on simulated and experimental cancer data.

Authors: Bettina M Pucher; Oana A Zeleznik; Gerhard G Thallinger
Journal: Brief Bioinform Date: 2019-03-25 Impact factor: 11.622

3. Intertumoral Heterogeneity within Medulloblastoma Subgroups.

Authors: Florence M G Cavalli; Marc Remke; Ladislav Rampasek; John Peacock; David J H Shih; Betty Luu; Livia Garzia; Jonathon Torchia; Carolina Nor; A Sorana Morrissy; Sameer Agnihotri; Yuan Yao Thompson; Claudia M Kuzan-Fischer; Hamza Farooq; Keren Isaev; Craig Daniels; Byung-Kyu Cho; Seung-Ki Kim; Kyu-Chang Wang; Ji Yeoun Lee; Wieslawa A Grajkowska; Marta Perek-Polnik; Alexandre Vasiljevic; Cecile Faure-Conter; Anne Jouvet; Caterina Giannini; Amulya A Nageswara Rao; Kay Ka Wai Li; Ho-Keung Ng; Charles G Eberhart; Ian F Pollack; Ronald L Hamilton; G Yancey Gillespie; James M Olson; Sarah Leary; William A Weiss; Boleslaw Lach; Lola B Chambless; Reid C Thompson; Michael K Cooper; Rajeev Vibhakar; Peter Hauser; Marie-Lise C van Veelen; Johan M Kros; Pim J French; Young Shin Ra; Toshihiro Kumabe; Enrique López-Aguilar; Karel Zitterbart; Jaroslav Sterba; Gaetano Finocchiaro; Maura Massimino; Erwin G Van Meir; Satoru Osuka; Tomoko Shofuda; Almos Klekner; Massimo Zollo; Jeffrey R Leonard; Joshua B Rubin; Nada Jabado; Steffen Albrecht; Jaume Mora; Timothy E Van Meter; Shin Jung; Andrew S Moore; Andrew R Hallahan; Jennifer A Chan; Daniela P C Tirapelli; Carlos G Carlotti; Maryam Fouladi; José Pimentel; Claudia C Faria; Ali G Saad; Luca Massimi; Linda M Liau; Helen Wheeler; Hideo Nakamura; Samer K Elbabaa; Mario Perezpeña-Diazconti; Fernando Chico Ponce de León; Shenandoah Robinson; Michal Zapotocky; Alvaro Lassaletta; Annie Huang; Cynthia E Hawkins; Uri Tabori; Eric Bouffet; Ute Bartels; Peter B Dirks; James T Rutka; Gary D Bader; Jüri Reimand; Anna Goldenberg; Vijay Ramaswamy; Michael D Taylor
Journal: Cancer Cell Date: 2017-06-12 Impact factor: 31.743

4. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction.

Authors: Wenqian Zhang; Ying Yu; Falk Hertwig; Jean Thierry-Mieg; Wenwei Zhang; Danielle Thierry-Mieg; Jian Wang; Cesare Furlanello; Viswanath Devanarayan; Jie Cheng; Youping Deng; Barbara Hero; Huixiao Hong; Meiwen Jia; Li Li; Simon M Lin; Yuri Nikolsky; André Oberthuer; Tao Qing; Zhenqiang Su; Ruth Volland; Charles Wang; May D Wang; Junmei Ai; Davide Albanese; Shahab Asgharzadeh; Smadar Avigad; Wenjun Bao; Marina Bessarabova; Murray H Brilliant; Benedikt Brors; Marco Chierici; Tzu-Ming Chu; Jibin Zhang; Richard G Grundy; Min Max He; Scott Hebbring; Howard L Kaufman; Samir Lababidi; Lee J Lancashire; Yan Li; Xin X Lu; Heng Luo; Xiwen Ma; Baitang Ning; Rosa Noguera; Martin Peifer; John H Phan; Frederik Roels; Carolina Rosswog; Susan Shao; Jie Shen; Jessica Theissen; Gian Paolo Tonini; Jo Vandesompele; Po-Yen Wu; Wenzhong Xiao; Joshua Xu; Weihong Xu; Jiekun Xuan; Yong Yang; Zhan Ye; Zirui Dong; Ke K Zhang; Ye Yin; Chen Zhao; Yuanting Zheng; Russell D Wolfinger; Tieliu Shi; Linda H Malkas; Frank Berthold; Jun Wang; Weida Tong; Leming Shi; Zhiyu Peng; Matthias Fischer
Journal: Genome Biol Date: 2015-06-25 Impact factor: 13.583

5. AGR3 in breast cancer: prognostic impact and suitable serum-based biomarker for early cancer detection.

Authors: Stefan Garczyk; Saskia von Stillfried; Wiebke Antonopoulos; Arndt Hartmann; Michael G Schrauder; Peter A Fasching; Tobias Anzeneder; Andrea Tannapfel; Yavuz Ergönenc; Ruth Knüchel; Michael Rose; Edgar Dahl
Journal: PLoS One Date: 2015-04-15 Impact factor: 3.240

Review 6. Prognostic and clinicopathological value of GATA binding protein 3 in breast cancer: A systematic review and meta-analysis.

Authors: Yawen Guo; Pan Yu; Zeming Liu; Yusufu Maimaiti; Chen Chen; Yunke Zhang; Xingjie Yin; Shan Wang; Chunping Liu; Tao Huang
Journal: PLoS One Date: 2017-04-10 Impact factor: 3.240

7. The challenges of big data biology.

Authors: Sabina Leonelli
Journal: Elife Date: 2019-04-05 Impact factor: 8.140

8. Multi-omics Data Integration for Identifying Osteoporosis Biomarkers and Their Biological Interaction and Causal Mechanisms.

Authors: Chuan Qiu; Fangtang Yu; Kuanjui Su; Qi Zhao; Lan Zhang; Chao Xu; Wenxing Hu; Zun Wang; Lanjuan Zhao; Qing Tian; Yuping Wang; Hongwen Deng; Hui Shen
Journal: iScience Date: 2020-01-17

Review 9. Relevance of Multi-Omics Studies in Cardiovascular Diseases.

Authors: Paola Leon-Mimila; Jessica Wang; Adriana Huertas-Vazquez
Journal: Front Cardiovasc Med Date: 2019-07-17

10. Integrative analysis of breast cancer profiles in TCGA by TNBC subgrouping reveals novel microRNA-specific clusters, including miR-17-92a, distinguishing basal-like 1 and basal-like 2 TNBC subtypes.

Authors: Karel Kalecky; Rebecca Modisette; Samantha Pena; Young-Rae Cho; Joseph Taube
Journal: BMC Cancer Date: 2020-02-21 Impact factor: 4.430

8 in total

1. Network Analysis of Microarray Data.

Authors: Alisa Pavel; Angela Serra; Luca Cattelani; Antonio Federico; Dario Greco
Journal: Methods Mol Biol Date: 2022

2. Unsupervised Algorithms for Microarray Sample Stratification.

Authors: Michele Fratello; Luca Cattelani; Antonio Federico; Alisa Pavel; Giovanni Scala; Angela Serra; Dario Greco
Journal: Methods Mol Biol Date: 2022

Review 3. Overview of omics biomarkers in pituitary neuroendocrine tumors to design future diagnosis and treatment strategies.

Authors: Busra Aydin; Aysegul Caliskan; Kazim Yalcin Arga
Journal: EPMA J Date: 2021-06-26 Impact factor: 8.836

Review 4. Human disease biomarker panels through systems biology.

Authors: Bradley J Smith; Licia C Silva-Costa; Daniel Martins-de-Souza
Journal: Biophys Rev Date: 2021-10-13

5. Cluster analyses of the TCGA and a TMA dataset using the coexpression of HSP27 and CRYAB improves alignment with clinical-pathological parameters of breast cancer and suggests different epichaperome influences for each sHSP.

Authors: Philip R Quinlan; Grazziela Figeuredo; Nigel Mongan; Lee B Jordan; Susan E Bray; Roman Sreseli; Alison Ashfield; Jurgen Mitsch; Paul van den Ijssel; Alastair M Thompson; Roy A Quinlan
Journal: Cell Stress Chaperones Date: 2022-03-02 Impact factor: 3.667

Review 6. Unsupervised Multi-Omics Data Integration Methods: A Comprehensive Review.

Authors: Nasim Vahabi; George Michailidis
Journal: Front Genet Date: 2022-03-22 Impact factor: 4.599

7. HYGIEIA: HYpothesizing the Genesis of Infectious Diseases and Epidemics through an Integrated Systems Biology Approach.

Authors: Bradley Ward; Jean Cyr Yombi; Jean-Luc Balligand; Patrice D Cani; Jean-François Collet; Julien de Greef; Joseph P Dewulf; Laurent Gatto; Vincent Haufroid; Sébastien Jodogne; Benoît Kabamba; Sébastien Pyr Dit Ruys; Didier Vertommen; Laure Elens; Leïla Belkhir
Journal: Viruses Date: 2022-06-23 Impact factor: 5.818

8. Integrated multiplex network based approach for hub gene identification in oral cancer.

Authors: S Mahapatra; R Bhuyan; J Das; T Swarnkar
Journal: Heliyon Date: 2021-06-29

8 in total