Literature DB >> 30703089

EMT network-based feature selection improves prognosis prediction in lung adenocarcinoma.

Borong Shao^1,2, Maria Moksnes Bjaanæs^3,4,5, Åslaug Helland^3,4,5, Christof Schütte^1,2, Tim Conrad^1,2.

Abstract

Various feature selection algorithms have been proposed to identify cancer prognostic biomarkers. In recent years, however, their reproducibility is criticized. The performance of feature selection algorithms is shown to be affected by the datasets, underlying networks and evaluation metrics. One of the causes is the curse of dimensionality, which makes it hard to select the features that generalize well on independent data. Even the integration of biological networks does not mitigate this issue because the networks are large and many of their components are not relevant for the phenotype of interest. With the availability of multi-omics data, integrative approaches are being developed to build more robust predictive models. In this scenario, the higher data dimensions create greater challenges. We proposed a phenotype relevant network-based feature selection (PRNFS) framework and demonstrated its advantages in lung cancer prognosis prediction. We constructed cancer prognosis relevant networks based on epithelial mesenchymal transition (EMT) and integrated them with different types of omics data for feature selection. With less than 2.5% of the total dimensionality, we obtained EMT prognostic signatures that achieved remarkable prediction performance (average AUC values >0.8), very significant sample stratifications, and meaningful biological interpretations. In addition to finding EMT signatures from different omics data levels, we combined these single-omics signatures into multi-omics signatures, which improved sample stratifications significantly. Both single- and multi-omics EMT signatures were tested on independent multi-omics lung cancer datasets and significant sample stratifications were obtained.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Biomarkers, Tumor

Year: 2019 PMID： 30703089 PMCID： PMC6354965 DOI： 10.1371/journal.pone.0204186

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Prognosis prediction is necessary for cancer clinical decision making. Traditionally, cancer prognosis prediction is based on clinical variables such as tumor stage, age, and disease history, where the information of a patient is compared against population cancer registries [1]. However, these clinical parameters are insufficient to accurately predict the risk of patients [2] as histologically similar tumors can be of completely different diseases at the molecular level [3, 4]. Therefore, molecular signatures are needed to give more accurate prognosis predictions. Nowadays we can obtain tumor molecular profiles in greater detail. As one of the large scale projects, the Cancer Genome Atlas (TCGA) provides access to genomic, transcriptomic, epigenomic, and proteomic data from more than 11,000 cases in 33 cancer types and subtypes [5]. Using these data, researchers aim to build better prognosis prediction models. This has been found as very challenging due to the high dimensionality of omics data, where the number of features far exceeds the number of samples. This is often addressed as the curse of dimensionality. When the data lie in high dimensions, the samples become very sparse. This can cause the lack of statistical significance and over-fitting of machine learning models. Fortunately, not all features are relevant for predicting the phenotype of interest. It is desired to find the molecular signatures that capture the footprint of the phenotype so that the signatures can be employed on unseen samples. Various feature selection methods were proposed to find molecular signatures. Early reviews categorized them into three categories: filter, wrapper, and embedded methods [6, 7]. Although many important algorithms were introduced, network-based feature selection algorithms were not included. After reviewing the literature, we found three main categories of network-based feature selection algorithms. The first category involves network-guided search. An algorithm identifies subnetworks that can best differentiate different phenotype groups. Each subnetwork is aggregated to produce one feature (called metagene) and eventually the metagenes are used as features for training predictive models. Different scoring functions were proposed to rank the subnetworks. For example, [8] used mutual information as the scoring function and the addition operator to aggregate subnetworks. [9, 10] used the p-value of Cox PH model in defining the scoring functions. [11] dichotomized features and defined a scoring function based on information theory. [12] tested the effects of different aggregation operators on the prediction performance. The second category of methods uses network-based regularization. Regularization methods such as Lasso have been widely applied for feature selection. To integrate network information, the penalty term takes into account the network connectivity. Adjacency matrix A and Laplacian matrix L are frequently used to represent a network G to be included in the penalty term. The majority of methods in this category are based on linear classifiers and can be written in the following form: For example, in graph Lasso penalty(w, G) = λ||w||1 + (1 − λ)∑ A(w − w). This forces adjacent nodes to have similar weights [13, 14]. Using similar formulations, [15] proposed a network-constrained regularization and feature selection method on genomic data. [16] added a network regularization term to the log-likelihood function of the Cox proportional hazard model. [17] developed a network-constrained support vector machine algorithm, where the network-based regularization term is added to the objective function of SVM. The third category of methods involves iterative updates of node importance scores. Frequently used algorithms include network propagation and random walk. [18, 19] adapted Google’s PageRank algorithm to rank genes in a network. Genes are assigned initial ranks . Then the rank of each gene is updated iteratively depending on the ranks of genes that are linked to it. For gene j, its rank from to is updated as: where deg is the degree of the ith gene and d is a fixed parameter. By iterating until convergence a gene will be highly ranked if it is linked to other highly ranked genes. [20] used random walk kernel to smooth gene-wise t-statistics over the network. This is achieved by assigning each node an initial score based on t-test and then multiplying it with the random walk kernel. The p-step random walk kernel is used as a similarity measure to capture the relatedness of two nodes in the network. It is defined as: where L is the normalized graph Laplacian matrix, α is a constant, and p is the number of random walk steps. The network-smoothed t-statistic is used to measure node importance. Similarly, random walk-based scoring of network components is applied in [21] to prioritize functional networks. Besides algorithmic difference, various biological networks have been employed in network-based feature selection algorithms. A list of these molecular interactions is given in Table 1. In these studies, it was shown that superior features could be selected by the integration of networks. However, recent studies showed that network-based feature selection methods did not significantly improve prediction performance, but mainly contributed to the biological interpretations of the signatures [22-24]. [22] compared 14 feature selection methods, 8 of which integrated network information and 6 of which did not, on 6 breast cancer datasets with respect to their prediction accuracy, signature stability and biological interpretations. The results showed that network-based features in most cases could not improve prediction accuracy significantly. [23, 24] showed that when a correction of feature set size was performed, the stability of network-based features was not higher than single features.

Table 1

Frequently used molecular/gene interaction networks in network-based feature selection studies.

Molecular interactions	Database	Version	Number of edges	Number of nodes	Studies
protein-protein	STRING	V10.5	547621	19578	[25–27]
protein-protein	HPRD	Release 9	41327	30047	[9–11, 17]
biological pathways	KEGG	Release 84.0	App	App	[15, 28]
biological pathways	Pathway Commons	V7	1912848	14863	[29]
miRNA-gene	miRTarBase	V7.0	502651	16822	[26]
transcription factor—target	TRANSFAC	V7.0 public	1504	1648	[19, 30]
gene co-expression	None	None	Data	Data	[16, 18]
gene functional linkage	Multiple	None	App	App	[8, 16, 31, 32]
gene ontology	Gene ontology	None	GO	GO	[18]

Frequently used molecular/gene interaction networks in network-based feature selection studies.

We listed below the basic information of the networks as well as exemplary studies that employed the networks. With STRING database we only considered the edges with confidence scores ≥ 0.9. When a database has information of many species, only Homo sapiens was considered. In the 4th and 5th columns, Data means that the network size is dependent on the dimensions of data. App means that the network size is dependent on the application. GO means that the network size is dependent on gene ontology terms. Regardless of whether network information is integrated, finding robust molecular signatures is a challenging task. Due to the high data dimensions, it is easy to find a feature subset that fits the training data very well but hard to have good generalization. Studies showed that there was hardly considerable overlap among biomarkers identified in different studies for the same disease [33-35]. Even taking random feature sets gave comparable prediction performance [36]. The existence of many feature subsets that perform similarly well on the training set makes it difficult to identify the true signatures. Note that the randomness of signatures is also observed when network information is integrated. [23] tested different network-based feature selection algorithms on six breast cancer datasets in prognosis prediction. They showed that the randomization of network structure, which destroyed biological information, did not deteriorate the prediction performance of the selected features. [24] extended the experiments in [23] by comparing more prognosis signatures. In the end, similar results were observed. We suppose that the main reason for these counter-intuitive results is the curse of dimensionality, where selecting molecular signatures is hard given the limited amount of samples. In principle, molecular signatures should give better predictions than random features, because it is shown in biological research that certain genes are supposed to be more important than the others in cancer progression. If we use this information to constrain the feature space and guide feature selection, we could potentially obtain more robust biomarkers. State-of-the-art studies have not utilized this knowledge but considered the whole feature space and the entire biological network. Because both the data and the network are large, the irrelevant information may overwhelm the signals. Furthermore, biological networks were typically integrated with one type of omics data. It would be very interesting to investigate how the prediction performance differs when the networks are integrated with different omics data types, and additionally what are the relationships among the features selected from different omics data. To address this issue, we proposed a phenotype relevant network-based feature selection (PRNFS) framework. It consists of constructing a gene regulatory network (GRN) specific for the phenotype of interest and selecting features from this network. We demonstrated the superiority of this framework with the application of lung adenocarcinoma (LUAD) prognosis prediction. We constructed a GRN for EMT, which has been demonstrated as highly relevant to cancer metastasis and prognosis. On this network 4 types of omics data (mRNA-Seq, miRNA-Seq, DNA methylation, and copy number alteration data) were integrated and 10 feature selection algorithms were employed. We obtained both single- and multi-omics EMT prognostic signatures, evaluated their prediction performance, analyzed the biological interpretations, and performed survival analysis. Furthermore, these signatures were tested on independent multi-omics LUAD data. We showed that EMT prognostic signatures achieved remarkable prediction performance on TCGA data. On independent data, both single- and multi-omics signatures stratified patients into significantly different prognostic groups. Multi-omics signatures were shown to be more robust than single-omics signatures.

Materials and methods

We will first describe the construction of EMT networks. This is followed by the introduction of 10 feature selection algorithms. Then we explain the details of the experiments.

EMT gene regulatory networks

As an up-to-date EMT GRN is not readily available, we constructed the network by literature review. The network we constructed has incorporated key transcription factors, miRNAs, their regulations and interactions with EMT hallmark molecules. Multiple levels of gene regulations such as transcriptional, translational, and post-translational regulations were covered. The reference for each component in the network can be found in [37] Chapter 2. Since this network covers mainly driver genes, we named it as the core network. A visualization of the network using igraph package [38] is provided in S1 Fig. As it is observed, driver genes are often less differentially expressed than the genes they regulate [35]. If one includes only the driver genes for identifying molecular signatures, one may have captured only partial information. We therefore extended this network by including the molecules that directly interact with or being regulated by the molecules in the core network. NetworkAnalyst tool [39] was employed to find these interactions, which consist of protein-protein interactions, miRNA-gene interactions, and transcription factor-gene interactions. The resulting network was named extended network. After constructing this network, we noticed that many features have a rather low variance among samples; we thus removed these features and obtained the filtered network. All three networks were employed in our experiments. The three networks contain 74, 455 and 123 nodes respectively. Details of the networks can be found in [37].

Experiments

We first obtained RNA-Seq, miRNA-Seq, DNA methylation, and CNA data of LUAD from FIREHOSE Broad Genome Data Analysis Center (GDAC) [40]. The GDAC data version is 2016_01_28. mRNA-Seq and miRNA-Seq data were combined because they both measure the abundance of transcripts. This resulted in 3 data levels: gene expression, DNA methylation, and CNA data. These three data levels will be abbreviated as GE, DM, and CNA in the remaining text. Each data level was normalized feature-wise by subtracting the mean and dividing by the standard deviation. More details of data pre-processing can be found in [37, 41]. Since we have obtained 3 EMT networks and 3 data levels, feature selection can be performed on each combination of network and data level. To evaluate whether EMT-based feature selection can give more robust molecular signatures for prognosis prediction, we employed 10 representative features selection algorithms to identify signatures from EMT genes and EMT networks. Table 2 gives an overview of these algorithms. Five of these algorithms integrate network information and the other five algorithms use only omics data. The underlying methodologies are very different. We suppose that if EMT network is superior for selecting prognostic signatures, the performance of the selected features from the majority of these algorithms should show improvements. As mentioned before, state-of-the-art studies usually use only gene expression data for feature selection. We instead incorporated three different omics data levels. This gives us the possibility to compare and integrate the signatures from different data levels.

Table 2

Overview of the 10 feature selection algorithms.

We listed below their main methodologies, whether network information was integrated, the algorithmic output and reference.

Algorithm	Methodology	Network	Output	Reference
t-test	t-statistic	No	feature ranking	[28, 42]
Lasso	regularized regression	No	coefficients	[43]
NetLasso	network-based regularization	Yes	coefficients	[15]
AddDA2	subnetwork scoring and searching	Yes	subnetworks	[8, 12]
NetRank	feature importance on network	Yes	feature ranking	[19]
stSVM	random walk on network	Yes	feature ranking	[20]
Cox	Cox PH model	No	feature ranking	[44, 45]
RegCox	regularized Cox PH model	No	coefficients	[46]
MSS	random sampling	No	feature ranking	[47]
Survnet	subnetwork scoring and searching	Yes	subnetworks	[9]

Overview of the 10 feature selection algorithms.

We listed below their main methodologies, whether network information was integrated, the algorithmic output and reference. Note that even the largest EMT network (the extended network) covers only 2.3% of the original data dimensions. To assess the performance of EMT-based feature selection, we compared the prediction performance of EMT signatures with the features selected out of all features from the corresponding data levels. Besides, random networks of the same size and structure as EMT networks were generated—the nodes in the random networks were randomly chosen from all the features of the corresponding data level. The features selected from these random networks were compared with EMT signatures. Additionally, we obtained a list of EMT hallmark genes from the Molecular Signatures Database (MSigDB) [48-50] and used these features for predictions. This was to examine whether EMT-based feature selection could outperform conventionally used gene sets. In total, we compared the prediction performance of EMT signatures with the following 5 groups: Random features. These features were selected from random networks using the same feature selection algorithms. 150 random networks were generated for feature selection. All EMT features. We included all features in the EMT networks without feature selection. All features from the corresponding data levels. This corresponds to 19,290 GE features, 20,074 DM features, or 21,456 CNA features. Features selected from all data level features by applying Lasso algorithm. 200 EMT hallmark features from MSigDB. We performed the comparison by selecting features with the training set, using these features to train an SVM classifier, and classifying samples on the cross-validation set. Patients who survived more than 1400 days belong to the good prognosis group and patients who survived less than 700 days belong to the poor prognosis group. The results from 30 times stratified 10-fold cross-validation were averaged. Within each data level the same cross-validation folds were used for all the feature selection algorithms on all the comparative groups. The classification performance was evaluated using three metrics: AUC (the area under the receiver operating characteristic curve), AUPR (the area under the precision-recall curve) and accuracy. To serve a more general audience, we also calculated the average odds ratio for each algorithm using 0.5 cut-off of SVM algorithm. We chose relatively stringent thresholds for feature selection, this is to reveal more difference than similarities between the two patient groups. We argue that it becomes harder to find the signatures if the two groups have more similar samples in terms of the phenotype. For example, if one uses a single threshold of 3 years, we assume that the molecular profiles of patients who survived a bit longer than 3 years may be very similar to patients who survived a bit shorter than 3 years. In this case, it is challenging to find the signatures that can capture the most important difference between the two groups, given the limited amount of samples and their heterogeneity. However, we did not omit the influence of thresholds. We tested the performance of all feature selection algorithms with four different thresholds, in the order of increasing discrepancy: 3 years, <900 or >1200 days, <700 or >1400 days, <500 or >1500 days. Besides the above-mentioned evaluation metrics, survival analysis was performed using the selected features on censored data. The data have much more samples that could not be included in classification.We think that if the selected features are good signatures, they should be able to stratify the patients into significantly different survival groups. We performed survival analysis on both all-stage and early-stage patients. The sample sizes for classification and for survival analysis (all stage patients) are given in Table 3.

Table 3

The description of datasets.

This table shows the sample sizes for labeled data (thresholds <700 and >1400 days) and censored data.

	labeled data			censored data
	good prognosis	poor prognosis	total	all stages
Level GE	84	99	183	497
Level DM	74	93	167	447
Level CNA	73	76	149	503

The description of datasets.

This table shows the sample sizes for labeled data (thresholds <700 and >1400 days) and censored data. Last but not least, we analyzed the biological interpretations of EMT signatures. Instead of performing gene set enrichment analysis, which could give very significant results due to the biological context of the EMT networks, we employed association rule mining approach [51]. to infer prognostic association rules. The rules have the advantage to directly associate the states of the features to the phenotype of interest. We inferred rules using EMT signatures from individual data levels and also from their different combinations. Our motivation is to understand whether features from different data levels complement each other and jointly contribute to patient prognosis. We were able to show that EMT signatures from different data levels complement each other in prognostic rules. This inspired us to obtain multi-omics EMT signatures by combining the signatures on individual data levels (single-omics signatures). Both single- and multi-omics EMT signatures were evaluated on TCGA data and independent LUAD multi-omics data using survival analysis. All the data and code for analysis are available at https://github.com/BorongShao/EMT_prognosis-master.

Results

EMT signatures outperformed comparative groups

First, we show that regardless of the employed feature selection algorithms and evaluation metrics, EMT-based feature selection always outperforms feature selection on random networks. Fig 1 shows the distributions of AUC, AUPR, and accuracy values of EMT signatures and random ones, where DM data and core EMT network were used. S2 Fig shows the same comparative groups using GE data with filtered EMT network. In both cases, the advantages of EMT signatures are very apparent.

Fig 1

The AUC, AUPR, and accuracies of EMT features versus random features using DM data with the core EMT network.

The AUC, AUPR, and accuracies of EMT features versus random features using DM data with the core EMT network.

Gaussian kernel is used to estimate the density functions based on results from 30 times 10-fold cross-validation. For each cross-validation fold, EMT features and random features are tested on the same training and cross-validation samples. Each row in the figure corresponds to one feature selection algorithm. The last row corresponds to using all EMT features. The p-values of paired t-tests are provided in each sub-figure. Next, we give the average AUC values of EMT signatures on all three data levels in Table 4. The boxplot of AUC values is given in S3 Fig. The average odds ratios of different algorithms are given in Table 5. These results show that features selected from GE and DM data obtained better prediction performance than features selected from CNA data. Depending on the data levels and network sizes, we find it hard to identify the best-performing feature selection algorithm. In the last three lines of the table we give the results of comparative groups 3, 4, and 5. This shows that EMT signatures in many cases outperformed EMT hallmark features and the features selected from all data level features. For example, with Lasso feature selection algorithm, which was applied in both EMT feature space and in the whole feature space, EMT signatures gave better predictions in more than half of the cases. This indicates that selecting prognostic signatures from a much smaller phenotype relevant network is a feasible approach. We also evaluated the performance of the 10 feature selection algorithms with different classification thresholds. Both SVM and random forest classifiers are employed. The results are given in S4 Fig. It shows that regardless of feature selection algorithms, using more discrepant thresholds tends to obtain higher AUC values. Meanwhile, a few algorithms such as addDA2, RegCox, and Survnet are more sensitive to the effect of thresholds than the other algorithms.

Table 4

The prediction performance of EMT signatures on three data levels.

The table gives the average AUC values of EMT signatures on three data levels with each EMT network. The results from comparative groups 2, 3, 4, and 5 are given in the third row and the last three rows.

Data Level	Gene expression			DNA Methylation			CNA
\|V(G)\|	74	123	455	74	123	455	70	117	445
EMT	0.662	0.728	0.691	0.698	0.679	0.671	0.616	0.645	0.608
t-test	0.658	0.709	0.677	0.688	0.675	0.669	0.616	0.626	0.621
Lasso	0.616	0.703	0.620	0.697	0.666	0.667	0.615	0.619	0.617
NetLasso	0.659	0.718	0.686	0.700	0.678	0.677	0.619	0.635	0.621
addDA2	0.650	0.675	0.651	0.699	0.661	0.702	0.597	0.626	0.616
NetRank	0.656	0.691	0.668	0.695	0.685	0.693	0.615	0.619	0.610
stSVM	0.651	0.693	0.639	0.669	0.668	0.687	0.608	0.617	0.616
Cox	0.673	0.705	0.712	0.703	0.707	0.696	0.620	0.664	0.675
RegCox	0.648	0.698	0.729	0.696	0.717	0.666	0.645	0.669	0.653
MSS	0.662	0.694	0.659	0.674	0.654	0.640	0.608	0.627	0.625
Survnet	0.646	0.661	0.679	0.702	0.688	0.680	0.626	0.693	0.682
All	0.648			0.652			0.612
All + Lasso	0.643			0.691			0.607
EMT hallmark	0.675			0.627			0.617

Table 5

The average odds ratios of EMT signatures on three data levels.

The table gives the average odds ratios of EMT signatures on three data levels with each EMT network. The results from comparative groups 2, 3, 4, and 5 are given in the third row and the last three rows.

Data Level	Gene expression			DNA Methylation			CNA
\|V(G)\|	74	123	455	74	123	455	70	117	445
EMT	3.162	7.021	5.563	4.27	4.263	4.223	1.779	2.766	1.257
t-test	3.974	6.112	8.872	4.767	6.044	8.31	3.086	3.663	4.745
Lasso	5.094	5.6	10.793	4.482	8.213	10.427	4.367	4.494	5.852
NetLasso	3.742	6.418	9.042	4.607	4.772	9.198	2.291	3.211	2.975
addDA2	5.304	4.714	10.716	5.432	6.736	15.647	2.639	5.201	7.297
NetRank	3.881	4.75	6.776	5.302	5.621	7.662	2.731	3.715	2.79
stSVM	4.118	6.238	2.448	3.682	4.22	5.253	2.173	1.63	1.721
Cox	3.685	5.684	7.666	5.083	5.499	4.821	1.398	3.562	4.897
RegCox	3.944	5.924	8.558	5.225	6.576	4.625	3.525	4.165	3.712
MSS	3.098	6.005	4.284	4.516	3.187	2.76	1.421	2.515	1.698
Survnet	2.798	4.159	4.798	5.455	3.694	4.544	3.25	5.394	5.457
All	4.316			2.221			1.233
All + Lasso	3.537			5.385			2.451
EMT hallmark	5.103			1.322			1.227

The prediction performance of EMT signatures on three data levels.

The average odds ratios of EMT signatures on three data levels.

Frequently selected features further improves predictions

Although EMT signatures were shown to be significantly predictive in the experiments above, we observed high variance in the AUC values from individual cross-validation tests (shown in S3 Fig). Some partitions of data into training and cross-validation sets led to good predictions and some led to poor predictions. Even on the small EMT feature space, this phenomenon is already frequently observed. This suggests that selecting molecular signatures based on single cross-validation test or single sample division into training and testing set is highly unreliable. We think that sample heterogeneity contributed to the high variance in prediction performance. Thus, we addressed this issue by employing the frequently selected features (FSFs) from all 30 times 10-fold cross-validation feature selection. Instead of using 20 features selected from each training set, we used the top 20 FSFs and tested their performance using the same evaluation approach. DM data and the extended EMT network were employed for the test, as this combination was shown in Table 4 to give above-average prediction performance. We compared the prediction performance of FSFs with that of individually selected features. The results are given in Fig 2. The density plots and results of statistical tests are given in S5 Fig.

Fig 2

The comparison of prediction performance between FSFs and individually selected features for different feature selection algorithms.

The boxplot is based on the results from 30 times stratified 10-fold cross-validation.

The comparison of prediction performance between FSFs and individually selected features for different feature selection algorithms.

The boxplot is based on the results from 30 times stratified 10-fold cross-validation. We observed that FSFs significantly outperformed individually selected features, except for NetRank algorithm. The average AUC values of t-test, Lasso, NetLasso, and addDA2 feature selection algorithms were 0.773, 0.825, 0.796, and 0.833, respectively. Correspondingly, the average odds ratios increased to 8.115, 12.631, 11.104, and 13.139. It shows that using FSFs can mitigate the effect of sample heterogeneity. Recall that we used only <2.5% of the original dimensionality, namely EMT features, for feature selection and prognosis prediction. The remarkable results are consistent with biological knowledge that EMT process is highly relevant to cancer prognosis [52-56].

Biological interpretations

After identifying EMT FSFs, we further investigated their biological interpretations, especially the relationships among FSFs from different omics data levels. We employed association rule mining approach It is originally defined as the following [57]: Let I = {i1, i2, …, i} be a set of n binary features called items. Let D = {t1, t2, …, t} be a set of transactions called the database. A rule is defined in the form: X ⇒ Y, where X, Y ⊆ I. The itemsets X and Y are called left-hand-side (LHS) and right-hand-side (RHS). In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are applied. Let a rule X ⇒ Y be identified on a set of transactions T. Commonly used constraints are given below: Support. It indicates how frequently the itemset appears in T. Confidence. It indicates how often a rule has been found to be true. Lift. It indicates the degree to which X and Y depend on each other. We first discretized the EMT features to binary values using the mean of each feature. Then we applied Apriori algorithm [58] to derive rules, with the constraints of confidence ≥ 0.8 and support ≥ 0.1. The algorithm was implemented in the arules R package [51]. Since we are trying to find molecular patterns for predicting prognosis, we set the RHS of the rules to be the class labels of prognosis. The resulting rules show sound biological interpretations according to established findings in cancer research [59, 60]. Here we interpret two rules identified from the core EMT network: {LOXL2 = high, TGFB1 = high, miR − 34a = low} ⇒ {prognosis = poor}, with support = 0.135, confidence = 1, lift = 2.046. This rule applies to all samples that have these 3 gene expression conditions. Biologically, it has been shown that LOXL2 can stabilize SNAI1. TGFB1 can phosphorylate SMAD2 and SMAD3, which interact with SMAD4 to activate HMGA2, which then activates SNAI1. When LOXL2 and TGFB1 are highly expressed, it not only induces SNAI1 gene expression but also stabilizes SNAI1 protein. miR-34a has the role of repressing SNAI1. When miR-34a has low gene expression, SNAI1 is less repressed. Taken together, these three conditions point to the direction of the high expression of SNAI1—a master transcription factor to induce EMT. This contributes to poor prognosis. In contrast, another rule which has an opposite LOXL2 state indicates good prognosis: {LOXL2 = low, ETS1 = low, LOXL2 = high} ⇒ prognosis = good, with support = 0.105, confidence = 1, lift = 1.956. In this scenario, LOXL2 has high DNA methylation level and low gene expression level, and thus not able to stabilize SNAI1. ETS1 gene is known to increase the expression of ZEB1 which induces EMT. In this rule ETS1 has low expression so it does not contribute to inducing EMT. These factors can contribute to good prognosis. S1 Table contains more examples.

From single- to multi-omics signatures

The FSFs above were obtained alternatively from single data levels. Therefore, we name them as single-omics signatures. To investigate whether molecular signatures incorporating multiple data levels can be superior, we combined single-omics signatures into multi-omics signatures and compared their capabilities in stratifying samples into different prognostic groups. Using these signatures, we clustered the samples into 3 groups with both k-means and spectral clustering algorithms. Survival analysis was performed on the resulting cluster by estimating Kaplan-Meier survival curves and conducting log-rank tests. The test results based on k-means algorithm are given in Table 6, where the columns show different data level combinations and the rows correspond to feature selection algorithms. The comparative groups of using all EMT features and using EMT hallmark gene sets are included. The test results based on spectral clustering algorithm are given in S2 Table. In both tables we observe that multi-omics signatures improve sample stratifications significantly. An example is visualized in S6, S7 and S8 Figs.

Table 6

The results of log-rank tests on stratified sample clusters using single- and multi-omics EMT signatures on all-stage samples.

K-means algorithm was employed for clustering the samples into 3 groups. We highlighted all p-values that are lower than 10e-3.

	GE	DM	CNA	GE+DM	GE+CNA	DM+CNA	GE+DM+CNA
t-test	7.87e-06	1.12e-01	3.62e-03	7.55e-06	8.75e-04	1.90e-03	8.25e-06
Lasso	4.58e-06	5.56e-02	5.54e-01	1.66e-04	2.28e-07	8.71e-01	1.18e-04
NetLasso	2.71e-01	6.45e-01	7.58e-02	2.96e-02	1.53e-02	4.83e-01	4.34e-01
addDA2	5.20e-10	5.17e-09	1.24e-04	8.99e-18	3.75e-05	1.11e-09	1.35e-07
NetRank	1.13e-07	1.79e-01	2.19e-02	2.50e-07	6.97e-06	7.31e-02	3.28e-06
stSVM	4.14e-02	3.39e-01	8.91e-01	8.86e-01	6.37e-02	5.85e-01	6.83e-01
Cox	2.55e-09	2.91e-03	5.30e-07	3.12e-04	1.70e-06	1.13e-04	6.11e-06
RegCox	1.78e-07	8.52e-03	2.67e-01	2.36e-09	1.52e-10	1.81e-07	2.52e-07
MSS	1.48e-03	5.95e-01	2.78e-01	6.29e-05	2.28e-04	2.59e-01	1.63e-03
Survnet	7.36e-05	6.25e-03	5.19e-03	2.59e-06	1.54e-03	3.77e-05	2.19e-05
Ensemble	2.32e-09	4.72e-02	6.05e-03	1.20e-04	1.62e-05	1.01e-01	1.18e-04
allemt	1.39e-02	4.30e-01	1.07e-01	7.64e-01	1.45e-02	2.07e-01	5.34e-01
EMT hallmark	1.62e-01	9.47e-01	3.09e-01	7.59e-01	5.82e-02	8.96e-01	7.58e-01

The results of log-rank tests on stratified sample clusters using single- and multi-omics EMT signatures on all-stage samples.

K-means algorithm was employed for clustering the samples into 3 groups. We highlighted all p-values that are lower than 10e-3. Next, we performed survival analysis on early stage patients. The results are given in Table 7. It shows that EMT-based signatures can still stratify the patients into significantly different prognostic groups.

Table 7

The results of log-rank tests on stratified sample clusters using single- and multi-omics EMT signatures on early-stage samples.

K-means algorithm was employed for clustering the samples into 3 groups. We highlighted all p-values that are lower than 10e-2.

	GE	DM	CNA	GE+DM	GE+CNA	DM+CNA	GE+DM+CNA
t-test	6.28e-03	9.38e-02	1.93e-01	4.78e-04	8.40e-03	1.55e-01	2.47e-01
Lasso	1.82e-04	1.20e-03	1.01e-01	2.35e-01	1.67e-06	2.51e-03	4.94e-03
NetLasso	7.95e-03	8.56e-01	2.46e-01	2.29e-01	1.01e-01	9.69e-01	9.31e-01
addDA2	2.53e-04	1.98e-05	1.63e-03	6.88e-08	3.52e-02	1.03e-05	8.51e-04
NetRank	9.31e-06	5.39e-01	4.08e-03	3.52e-03	5.35e-04	7.54e-03	8.57e-04
stSVM	3.17e-02	2.99e-01	2.86e-01	4.00e-01	2.16e-02	1.32e-01	8.57e-01
Cox	4.40e-04	2.15e-01	2.42e-02	1.85e-02	3.30e-02	1.10e-02	6.43e-04
RegCox	8.52e-04	3.36e-01	2.33e-02	8.58e-03	2.18e-05	2.03e-03	3.90e-02
MSS	6.51e-02	6.51e-01	2.43e-01	2.34e-02	3.10e-02	9.91e-01	5.91e-02
Survnet	4.16e-03	3.05e-01	6.24e-02	6.78e-02	8.66e-02	2.27e-02	8.08e-03
Ensemble	4.03e-04	1.54e-01	4.54e-02	5.06e-03	4.01e-04	5.79e-04	7.95e-04
allemt	2.59e-01	9.45e-01	7.04e-03	6.31e-01	1.38e-02	4.16e-01	9.73e-01

The results of log-rank tests on stratified sample clusters using single- and multi-omics EMT signatures on early-stage samples.

K-means algorithm was employed for clustering the samples into 3 groups. We highlighted all p-values that are lower than 10e-2. Last but not least, we tested the performance of two integrative clustering algorithms: SNF [61] and iCluster [62] with multi-omics EMT signatures. Briefly, SNF algorithm constructs sample similarity networks using individual data levels and then fuses these networks into a single similarity network, where spectral clustering is used to decide sample clusters. iCluster employs joint latent variable model to connect different data levels. The latent component is used to determine sample clusters. Based on the clustering results we performed survival analysis. The results of log-rank tests are given in S3 Table for SNF algorithm and in S4 Table for iCluster algorithm. We observed that neither SNF nor iCluster algorithm yielded better sample stratifications than using k-means algorithm (Table 6).

Test results on independent data

We obtained the test data from [63] including 164 samples with DM data. 121 of these samples have also mRNA expression data (microarray) available. The patient follow up time ranges between 2 and 99 months with the median of 44 months. The outcome (event) is defined as the occurrence of relapse, distant metastasis or death. The time to event is calculated from the date of surgery. Detailed experimental procedures and the processing of raw data are provided in [63]. EMT single- and multi-omics signatures consisting of GE and DM data levels were evaluated on the test data using survival analysis. EMT signatures were extracted from the test data without any additional training or modifications. Hierarchical clustering, instead of k-means was employed in order to compare our results with the original study [63]. We have tested the EMT signatures selected by each feature selection algorithm [37]. It is show that single-omics signatures can already stratify the samples into significantly different prognostic groups. An example is given in Fig 3. Multi-omics signatures often yielded better sample stratifications. Fig 4 shows an example where the multi-omics signature from a feature selection algorithm can significantly stratify the samples while the single-omics signatures cannot. Compared with the survival analysis results in the original study [63], we achieved more significant sample stratifications with EMT signatures.

Fig 3

EMT single-omics signatures can stratify test samples into significantly different prognostic groups.

The signature is selected by addDA2 algorithm using DM data.

Fig 4

EMT multi-omics signatures can stratify test samples into significantly different prognostic groups, when the corresponding single-omics signatures cannot.

The signature consists of both GE and DM single-omics signatures selected by t-test.

EMT single-omics signatures can stratify test samples into significantly different prognostic groups.

The signature is selected by addDA2 algorithm using DM data.

EMT multi-omics signatures can stratify test samples into significantly different prognostic groups, when the corresponding single-omics signatures cannot.

The signature consists of both GE and DM single-omics signatures selected by t-test.

Discussion

Various feature selection algorithms have been proposed to identify biomarkers from Omics data for predicting the phenotype of interest. Although more and more information such as biological networks and multiple types of omics data have been integrated in feature selection, recent studies show the low reproducibility of molecular signatures [22, 24, 35]. Some accredit this to the existence of a large number of genes that are correlated with the target labels [12]. Given the limited amount of samples, it becomes very hard to differentiate the marker genes and irrelevant genes. We addressed this issue by constructing a phenotype relevant gene regulatory network, integrating multiple types of omics data with the network to select molecular signatures. We have shown that with lung cancer prognosis prediction, EMT signatures selected from only 2.5% of the original feature space outperformed the classical feature selection on the whole feature space. To the best of our knowledge, we for the first time constructed a phenotype-relevant GRN for lung cancer prognosis prediction. Previously we employed EMT networks for selecting lung cancer prognostic signatures [41, 64]. However, [64] used mRNA expression and miRNA expression data only. [41] employed three data levels for feature selection but obtained no significant improvement in predictions. In this study, we extended the EMT network to incorporate its interacting molecules. Besides, we reviewed the network used in [41] and removed the edges which denote associations rather than direct gene regulations. What also distinguishes this study from our previous work is the employment of 10 representative feature selection algorithms, instead of decomposing the network into network motifs [41, 64]. We have selected EMT signatures on three data levels with different network sizes, compared with the features selected from the whole data dimensions, and derived prognostic rules from EMT signatures. Furthermore, we obtained multi-omics signatures and showed their superior prediction performance over single-omics signatures. This shows that signatures from multiple omics data types can complement each other to better distinguish different phenotypes. The potential of EMT molecules in prognosis prediction has also been studied before. [65] and [66] performed survival analysis using individual EMT hallmark molecules such as E-cadherin and vimentin and showed that none of these molecules could separate LUAD or bladder cancer patients into significantly different prognostic groups. Note that these conclusions were drawn from mainly univariate analysis. Since the molecules jointly contribute to the phenotype, it could be more helpful to use a set of features. This can be seen also from the prognostic association rules derived from EMT signatures, where EMT molecules are jointly associated with the phenotype. All in all, we successfully demonstrated that EMT network-based feature selection and data integration can provide advantages in selecting cancer prognostic signatures. We think it would be very interesting to further investigate whether this approach could be utilized to identify robust molecular signatures for other phenotypes and diseases. This could potentially improve our understanding of diseases at the molecular level and help develop more individualized medicine.

Core EMT network.

The names of genes and miRNAs are given on the nodes. (TIF) Click here for additional data file.

The AUC, AUPR, and accuracies of EMT features versus random features using gene expression data with filtered EMT network.

Gaussian kernel is used to estimate the density functions based on results from 30 times 10-fold cross-validation. For each cross-validation fold, EMT features and random features are tested on the same training and testing samples. The comparisons on five feature selection algorithms together with the comparative group of using all EMT features are shown. The p-values of paired t-tests are provided. (TIF) Click here for additional data file.

The AUC values of 10 feature selection algorithms.

The three panels correspond to three data levels. Within each panel, the AUC values of the 10 algorithms are plotted. Each algorithm has three boxes of different colors denoting the 3 EMT networks. The blue and red dotted lines within each panel are the median AUC values of two comparative groups: 1) using all data level features and 2) Lasso feature selection on all data level features. (TIF) Click here for additional data file.

The AUC values of 10 feature selection algorithms using different thresholds for classification.

The data level is DNA methylation data. The network is EMT core network. (TIF) Click here for additional data file.

The comparison of FSFs with individually selected features in terms of AUC, AUPR, and accuracy values.

We used DNA methylation data and extended EMT network for feature selection and SVM classifier for classification. Gaussian kernel is used to estimate the density functions based on results from 30 times stratified 10-fold cross-validation. For each cross-validation iteration, individually selected features and FSFs are tested on the same training and testing samples. The comparison between the two feature groups is shown on five feature selection algorithms together with the p-values of paired t-tests. (TIF) Click here for additional data file.

Patient stratification using GE features from addDA2 algorithm.

(TIF) Click here for additional data file.

Patient stratification using DM features from addDA2 algorithm.

(TIF) Click here for additional data file.

Patient stratification using GE and DM features from addDA2 algorithm.

(TIF) Click here for additional data file.

Top 20 prognostic association rules derived from the FSFs using filtered EMT network.

All the following rules have confidence scores of 1. (PDF) Click here for additional data file.

The p-values of log-rank tests based on the clustering of spectral clustering algorithm for different data level combinations using extended EMT network.

We highlighted all p-values that are lower than 10e-5. (PDF) Click here for additional data file.

The p-values of log-rank tests based on SNF clustering using different data level combinations with extended EMT network.

We highlighted all p-values that are lower than 10e-5. (PDF) Click here for additional data file.

The p-values of log-rank tests based on iCluster clustering using different data level combinations with extended EMT network.

We highlighted all p-values that are lower than 10e-5. (PDF) Click here for additional data file.

54 in total

1. Discovering regulatory and signalling circuits in molecular interaction networks.

Authors: Trey Ideker; Owen Ozier; Benno Schwikowski; Andrew F Siegel
Journal: Bioinformatics Date: 2002 Impact factor: 6.937

Review 2. Filter versus wrapper gene selection approaches in DNA microarray domains.

Authors: Iñaki Inza; Pedro Larrañaga; Rosa Blanco; Antonio J Cerrolaza
Journal: Artif Intell Med Date: 2004-06 Impact factor: 5.326

3. Prediction of cancer outcome with microarrays: a multiple random validation strategy.

Authors: Stefan Michiels; Serge Koscielny; Catherine Hill
Journal: Lancet Date: 2005 Feb 5-11 Impact factor: 79.321

4. EMT in cancer.

Authors: Thomas Brabletz; Raghu Kalluri; M Angela Nieto; Robert A Weinberg
Journal: Nat Rev Cancer Date: 2018-01-12 Impact factor: 60.716

5. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications.

Authors: T Sørlie; C M Perou; R Tibshirani; T Aas; S Geisler; H Johnsen; T Hastie; M B Eisen; M van de Rijn; S S Jeffrey; T Thorsen; H Quist; J C Matese; P O Brown; D Botstein; P E Lønning; A L Børresen-Dale
Journal: Proc Natl Acad Sci U S A Date: 2001-09-11 Impact factor: 11.205

Review 6. Hallmarks of cancer: the next generation.

Authors: Douglas Hanahan; Robert A Weinberg
Journal: Cell Date: 2011-03-04 Impact factor: 41.582

7. Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes.

Authors: Christof Winter; Glen Kristiansen; Stephan Kersting; Janine Roy; Daniela Aust; Thomas Knösel; Petra Rümmele; Beatrix Jahnke; Vera Hentrich; Felix Rückert; Marco Niedergethmann; Wilko Weichert; Marcus Bahra; Hans J Schlitt; Utz Settmacher; Helmut Friess; Markus Büchler; Hans-Detlev Saeger; Michael Schroeder; Christian Pilarsky; Robert Grützmann
Journal: PLoS Comput Biol Date: 2012-05-17 Impact factor: 4.475

8. Network-based analysis of omics data: the LEAN method.

Authors: Frederik Gwinner; Gwénola Boulday; Claire Vandiedonck; Minh Arnould; Cécile Cardoso; Iryna Nikolayeva; Oriol Guitart-Pla; Cécile V Denis; Olivier D Christophe; Johann Beghain; Elisabeth Tournier-Lasserve; Benno Schwikowski
Journal: Bioinformatics Date: 2017-03-01 Impact factor: 6.937

9. Identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm.

Authors: Emmanuel Martinez-Ledesma; Roeland G W Verhaak; Victor Treviño
Journal: Sci Rep Date: 2015-07-23 Impact factor: 4.379

10. A network module-based method for identifying cancer prognostic signatures.

Authors: Guanming Wu; Lincoln Stein
Journal: Genome Biol Date: 2012-12-10 Impact factor: 13.583

3 in total

1. UBE2L3 promotes lung adenocarcinoma invasion and metastasis through the GSK-3β/Snail signaling pathway.

Authors: Xingjie Ma; Weibo Qi; Fan Yang; Huan Pan
Journal: Am J Transl Res Date: 2022-07-15 Impact factor: 3.940

2. A chimeric virus-based probe unambiguously detects live circulating tumor cells with high specificity and sensitivity.

Authors: Xinping Fu; Lihua Tao; Xiaoliu Zhang
Journal: Mol Ther Methods Clin Dev Date: 2021-08-28 Impact factor: 6.698

3. Prediction of an outcome using NETwork Clusters (NET-C).

Authors: Jai Woo Lee; Jie Zhou; Erika L Moen; Tracy Punshon; Anne G Hoen; Megan E Romano; Margaret R Karagas; Jiang Gui
Journal: Comput Biol Chem Date: 2020-12-08 Impact factor: 2.877

3 in total