Jeaneth Machicao1, Francesco Craighero1, Davide Maspero1, Fabrizio Angaroni1, Chiara Damiani1, Alex Graudenzi1, Marco Antoniotti1, Odemir M Bruno1. 1. 1São Carlos Institute of Physics, University of São Paulo, São Carlos, Brazil; 2School of Engineering, University of São Paulo, São Paulo, Brazil; 3Department of Informatics, Systems and Communication, University of Milan-Bicocca, Milan, Italy; 4Institute of Molecular Bioimaging and Physiology, Consiglio Nazionale delle Ricerche (IBFM-CNR), Segrate, Milan, Italy; 5Department of Biotechnology and Biosciences, University of Milan-Bicocca, Milan, Italy; 6Sysbio Centre for Systems Biology, Milan, Italy; 7Bicocca Bioinformatics, Biostatistics and Bioimaging Center (B4), University of Milan-Bicocca, Milan, Italy.
Abstract
BACKGROUND: The increasing availability of omics data collected from patients affected by severe pathologies, such as cancer, is fostering the development of data science methods for their analysis. INTRODUCTION: The combination of data integration and machine learning approaches can provide new powerful instruments to tackle the complexity of cancer development and deliver effective diagnostic and prognostic strategies. METHODS: We explore the possibility of exploiting the topological properties of sample-specific metabolic networks as features in a supervised classification task. Such networks are obtained by projecting transcriptomic data from RNA-seq experiments on genome-wide metabolic models to define weighted networks modeling the overall metabolic activity of a given sample. RESULTS: We show the classification results on a labeled breast cancer dataset from the TCGA database, including 210 samples (cancer vs. normal). In particular, we investigate how the performance is affected by a threshold-based pruning of the networks by comparing Artificial Neural Networks, Support Vector Machines and Random Forests. Interestingly, the best classification performance is achieved within a small threshold range for all methods, suggesting that it might represent an effective choice to recover useful information while filtering out noise from data. Overall, the best accuracy is achieved with SVMs, which exhibit performances similar to those obtained when gene expression profiles are used as features. CONCLUSION: These findings demonstrate that the topological properties of sample-specific metabolic networks are effective in classifying cancer and normal samples, suggesting that useful information can be extracted from a relatively limited number of features.
BACKGROUND: The increasing availability of omics data collected from patients affected by severe pathologies, such as cancer, is fostering the development of data science methods for their analysis. INTRODUCTION: The combination of data integration and machine learning approaches can provide new powerful instruments to tackle the complexity of cancer development and deliver effective diagnostic and prognostic strategies. METHODS: We explore the possibility of exploiting the topological properties of sample-specific metabolic networks as features in a supervised classification task. Such networks are obtained by projecting transcriptomic data from RNA-seq experiments on genome-wide metabolic models to define weighted networks modeling the overall metabolic activity of a given sample. RESULTS: We show the classification results on a labeled breast cancer dataset from the TCGA database, including 210 samples (cancer vs. normal). In particular, we investigate how the performance is affected by a threshold-based pruning of the networks by comparing Artificial Neural Networks, Support Vector Machines and Random Forests. Interestingly, the best classification performance is achieved within a small threshold range for all methods, suggesting that it might represent an effective choice to recover useful information while filtering out noise from data. Overall, the best accuracy is achieved with SVMs, which exhibit performances similar to those obtained when gene expression profiles are used as features. CONCLUSION: These findings demonstrate that the topological properties of sample-specific metabolic networks are effective in classifying cancer and normal samples, suggesting that useful information can be extracted from a relatively limited number of features.
The development of automated strategies for the classification of cancer samples in distinct categories (e.g., subtypes, risk groups, etc.) is one of the key challenges in current biosciences [1]. On the one hand, this might lead to the discovery of efficient, personalized diagnostic, prognostic, and therapeutic strategies for cancer patients. On the otherhand, it could allow unraveling some of the still undeciphered mechanisms and processes underlying cancer development, leading to a data-driven understanding of the disease.It is known that effective classification and clustering of cancer samples can be achieved by employing the information on expression data [2-7], genomic alteration profiles [8, 9], interaction networks [10], and even signaling pathways [11, 12]. In this work, however, we specifically focus on the metabolic properties that may distinguish cancer from normal samples. In fact, metabolic deregulation is one of the key hallmarks of cancer [13-15], even if its underlying mechanisms are still partially unknown. In this respect, in recent years, an increasing number of computational strategies have been devised, in order to take advantage of the growing availability and reliability of -omics data to investigate the alterations of metabolism in cancer [16-19]. Very often, such data have been employed in constraint-based models, such as Flux Balance Analysis (FBA), in which metabolic fluxes are simulated to compare different experimental scenarios [20-24].Moreover, more recently, approaches coupling constraint-based metabolic modeling with supervised machine learning algorithms have been proposed [25]. In our case, we explore for the first time the possibility of employing the topological properties of metabolic networks as input features of classification algorithms. To this end, we rely on an approach firstly introduced in [26,27] in which transcriptomic data, such as RNA-seq, are employed to determine the approximate activity value of the reactions included in a given metabolic network.More in detail, by introducing a relevance threshold on the metabolic activity level, we pruned the original metabolic network to define individual-specific networks in which only the significantly active reactions are preserved. The topological properties of such individual-specific networks are then used as features to perform a supervised classification task via various algorithmic strategies and, in particular, Multi-Layer Perceptrons (MLPs), Support Vector Machines (SVMs) and Random Forests (RFs).To investigate our hypothesis, this work presents the classification results in a simple scenario in which the sample categories are known a priori - cancer vs. normal - concerning the TCGA-BRCA breast cancer dataset [28], which includes 210 total samples.We show that noteworthy classification performance can be achieved by using a few key topological properties of metabolic networks, i.e., average degree, average hierarchical degree, average geodesic path length and assortativity. Interestingly, a similar pruning threshold (in the range 0.01 – 0.1) is identified as optimal for all tested machine learning strategies, suggesting that it could be an effective choice to extract useful information from the “relevant” activity of metabolic networks, while discarding possible artifacts due to noisy observations. Overall, the best classification performance is obtained with SVMs and threshold 0.1, which exhibit 0.866 of (average) accuracy, 0.86 precision and 0.879 recall on the test set, after k-fold cross-validation and hyper-parameter estimation. Furthermore, we show that the best performing SVM classifier (with the optimal threshold) delivers similar classification performance with respect to an analogous classifier processing a reduced gene expression feature vector, as computed by selecting the 5 principal components on the list of 1673 metabolic genes from Recon2.2 [29].These results prove that the projection of transcriptomic activity on metabolic networks provides useful information to efficiently classify cancer samples and might pave the way for the development of strategies for experimental hypothesis generation.
Materials and Methods
Integration of RNA-seq and Metabolic Networks
As proposed earlier [26, 27], it is possible to project transcriptomic data onto human metabolic networks [30], to derive an approximate activity value for each metabolic reaction in any given sample.We first employ an input metabolic network such as the Human Metabolic Reaction (HMR) [31] or Recon [29, 32]. is a bipartite-directed graph that includes two kinds of nodes: (i) metabolites (i.e., substrates or products), and (ii) metabolic reactions. The edges in connect either: (i) the substrates and the relative reaction, or (ii) a reaction and the relative products. The total number of nodes of is N, whereas the total number of edges is E. Reaction nodes are associated with Gene-Protein-Reaction (GPR) rules, i.e., logical formulas that describe the related catalyses via AND and OR logical operators. In particular, AND rules are employed when distinct genes encode different subunits of the same enzyme, whereas OR rules are used when distinct genes encode isoforms of the same enzyme.RNA-seq data are then used to provide an approximate activity value to each reaction in the input network. In particular, our method takes as input a n (genes) × m (samples) matrix T in which each element , , , includes the transcript level of gene g in sample s (the Reads per Kilobase per Million mapped reads – RPKM).For each reaction in the input network and for each sample , we define a Reaction Activity Score (RAS), by distinguishing two cases.Reactions with GPR including an AND operator,(1)where is the set of genes that encode the subunits of the enzyme catalyzing reaction r.Reactions with GPR including an OR operator,(2)where is the set of genes that encode isoforms of the enzyme that catalyzes reaction r.In case of composite reactions, we respect the standard precedence of the two operators. The rationale underlying the definition of the RAS is that enzyme isoforms (OR) contribute additively to the overall activity of a certain reaction, whereas enzyme subunits (AND) limit its activity. RASs are finally normalized to obtain values in the range [0, 1] (with 0 meaning no activity and 1 meaning maximum activity observed in the dataset).Even though this simplified approach neglects the heterogeneity of reaction kinetic constants, protein binding affinities and translation rates, it was proven effective in the investigation of cancer metabolic deregulation and in cancer sample stratification [26, 27].
Cancer Sample Classification via Metabolic Network Pruning
We define the sample-specific metabolic network of a given sample s as the weighted adjacency matrix , which contains elements, such that each element is equal to: (i) if i is a substrate of reaction j, (ii) if i is a reaction and j one of its products, (iii) 0 otherwise.Since we are interested in exploiting the topological properties of the “giant component” of the sample-specific metabolic network (as proposed, e.g., in [33]), we employ a network pruning procedure to select the relevant metabolic reactions. This threshold criterion was employed earlier [34-36]. In detail, a threshold parameter is used to obtain an unweighted and thresholded adjacency matrix, the elements of which are defined as follows:(3)It must be noted that we have focused on the larger than option, because we can hypothesize that only significantly active reactions (above the threshold) are responsible for the phenotypic/functional properties of cells. By scanning different values of the threshold, we can then evaluate the impact on the performance of classifiers that take as input certain topological measurements of the resulting giant component (see below), thus identifying an optimal threshold value.Clearly the threshold parameter determines the size of the giant component, i.e., the largest connected subgraph of the sample-specific metabolic network, which we define as and which includes nodes and edges.For instance, in Fig. (, one can see how the number of nodes and edges of the giant component of the sample-specific metabolic network (computed from the Recon2.2 network [29, 32]) is generally affected by the choice of distinct thresholds, regarding both cancer and normal samples. In greater detail, on the left side of Fig. (, smaller thresholds, such as , lead to a larger size of the giant component (on average), while on the right side, larger thresholds, such as , lead to a radical network reduction, with a threshold retaining 127 nodes on average, which represents approximately 1.45% of the total number of nodes of the original metabolic network. As a representative example, the shrinking of the giant component for a specific sample is visually represented in Fig. (.
Fig. (1)
Number of nodes (averaged on all samples) (a) and number of edges (averaged on all samples) (b) of the giant component of the sample-specific metabolic network (computed from the Recon2.2 network [29]), in addition to their standard deviation (error bar), defined by different threshold values either on normal and cancer samples. (A higher resolution / colour version of this figure is available in the electronic copy of the article).
Fig. (2)
The giant components of the metabolic network of the cancer sample of patient TCGA BH A0DZ obtained by projecting RNA-seq data on Recon2.2 metabolic network [29], are shown. 4 distinct giant components are shown, obtained with the following relevance thresholds: . Networks were drawn via Cytoscape [37]. (A higher resolution / colour version of this figure is available in the electronic copy of the article).
We also note that this behavior occurs similarly on both cancer and normal samples, even if the size of the giant component of the former ones tends to be slightly smaller. One may speculate that cancer subpopulations engage in a relatively lower number of metabolic functions with respect to normal cells, given that their main objective is “selfish” proliferation. Further investigations are needed to validate this interesting hypothesis [37].
Algorithmic Methods for Classification
In general, the choice of adequate network descriptors is crucial for pattern recognition purposes. Typically, the feature extraction is based on well-established network structural measures (see details in Section 2.3.1). The concurrent use of well-known measures such as degree, mean degree, clustering coefficient, mean hierarchical degree, centrality, and even spectral measurements, can identify global properties shared by a large majority of empirical and synthetic networks such as random, small-world, scale-free networks, and geographic networks models [38, 39].
Features Based on Network Structural Measures
Networks measurements falling in various categories (e.g., connectivity-related, distance-related, spectral, degree correlation measures) can be effectively used to characterize the topological properties of real-world networks [38, 40]. In our case, we are interested in determining whether certain topological measurements of the giant component of the sample-specific metabolic network obtained from RNA-seq data projection, and after opportune threshold-based pruning, can be effectively employed as features to classify cancer samples. In particular, we selected the following measures.Average Degree: Among the connectivity-related measurements, we here consider the degree (or connectivity) of node i of the giant component of sample s, given threshold , as the number of neighbors of a node defined by:.Accordingly, the average degree of the giant component is defined by Eq. (4), as follows:(4)Average Hierarchical Degree: The hierarchical degree of node i can also be measured considering the connectivity of the neighboring nodes constrained to a hierarchical level h. As an example, in social networks, the hierarchical degree of level 2 of given node i, , is the sum of the degrees of the neighbors of its neighbors. Therefore, the mean hierarchical degree of the giant component of a sample-specific metabolic network is given by Eq. (5), as follows:. (5)Average Geodesic Path Length: A path is defined as the sequence of nodes visited to go from node i to j. The distance between them is the number of edges within the path, and is defined as the geodesic path, i.e., the smallest path length. When there is no path between i and j, . The average geodesic path length of the giant component of the sample-specific metabolic network is given by:(6)where i and j are two nodes of the giant component and corresponds to a normalization factor, considering a fully connected network [40].Assortativity: The assortativity [41], i.e., the Pearson correlation coefficient of degree among all pairs of linked nodes i and j of the giant component, quantifies the tendency of the nodes of a given degree k to connect to nodes with a similar degree and, in our case, it is defined as follows:, (7)is a value within the range [−1, 1]. Values closer to 1 indicate a positive correlation (nodes with high degree tend to connect to nodes with high degree), while values closer to −1, indicate a negative correlation (nodes with a high degree tend to connect to nodes with low degree), whereas values close to 0 indicates the absence of linear dependence.In the following, we will show how to compose a feature vector by considering a set of topological measurements [35, 36, 38]. In this respect, the giant component of a sample-specific metabolic network can be characterized by a tuple containing: (i) the average degree (Eq. 4), (ii) the average hierarchical degree of level 2 (Eq. 5), (iii) the average hierarchical degree of level 3 (Eq. 5), (iv) the average geodesic path length (Eq. 6) and (v) the assortativity (Eq. 7). The vector is given by:(8)We notice that other measures such as the clustering coefficient might be employed as features. However, since in our case the input network is bipartite, there are no triangle neighborhoods and, accordingly, the clustering coefficient would always be 0. Since our framework is designed to be general, one can expect this feature to be relevant in different experimental scenarios, with distinct datasets and alternative representations of reaction graphs [42-44].
Classification Setup
Given any relevance threshold , the feature vectors are extracted for the resulting giant component of each sample s, and the classification step can be performed. The main goal of this analysis is to evaluate the classification performance of various classifiers , i.e., MLPs, SVMs and RFs on the feature vector . Furthermore, we tested the same classifiers on a reduced feature vector, including the 5 first principal components of the expression profiles of the 1673 metabolic genes present in the Recon2.2 model [29], in order to provide a comparison on the same number of features employed in our approach.In order to prevent over-optimistic results, we performed for each classifier a nested cross-validation as proposed earlier [45] and detailed as follows.The original dataset, including cancer and normal samples, is split into 5 folds, ensuring the balance between classes. 5-fold outer cross-validation is executed by using: (i) one fold as the test set to assess the model performance and (ii) 4 folds in an inner 5-fold cross-validation procedure to select the optimal hyperparameters h of the model via grid search (Table ). The whole procedure is repeated 3 times to ensure robustness to the results. The performance of all classifiers is assessed on average accuracy, precision and recall with respect to ground-truth labels.All the experiments described above were performed using the scikit-learn Python library [46].
Network Datasets
We tested our approach on the breast cancer dataset TCGA-BRCA published earlier [28]. We downloaded the dataset via the cBioPortal [47]. This dataset includes the expression profile (RNA Seq V2 RSEM) of biopsies taken from 817 patients. We selected the 105 patients for which the expression profiles of both cancer and normal tissues are provided, for a total of 210 samples used in our analysis.RNA-seq data were projected on the Recon2.2 metabolic network [29, 32] to obtain a dataset in which a Reaction Activity Score is assigned to each metabolic reaction in each sample (see above). The RASs were then normalized by dividing each reaction score by the maximum value of all samples. Finally, normalized RAS profiles are used to weigh the metabolic network as described above.SVM MLP RF
Results
RAS Threshold Analysis
A small will result in larger giant components while, in contrast, higher values of will result in smaller giant components. To choose the best classifier, we evaluated the performance obtained by the following distinct threshold values:(9)Thus, each feature vector contains the five topological measures defined above as descriptors (see Section 2.3.1).To test the discrimination power of the feature vectors in Fig. (, we computed the Kolmogorov-Smirnov statistic [48] between normal and cancer samples for each threshold and topological measure. The KS statistic (KS-test) is the distance between the cumulative probability distributions; hence the higher is the value, the more the network measures are different between normal and cancer samples.
Fig. (3)
Kolmogorov-Smirnov statistic (KS-test, [48]) between normal and cancer samples for each threshold and network topological measure: average degree assortativity average hierarchical degree of level 2 and 3 and average geodesic path length . The higher the K-S test is, the more the distribution of the network measure is different between normal and cancer samples. The highest values are obtained with and thresholds equal to and . (A higher resolution / colour version of this figure is available in the electronic copy of the article).
As a result, in our dataset, degree statistics, i.e., , and , achieve the highest D (KS-test), in particular for thresholds equal to 10−2 and 0.1. In Fig. (, we plotted the distributions of all pairs of features in , for . In accordance with the results of Fig. (, the degree statistics distributions and, in particular, and , have the sharpest difference among normal and cancer samples.
Fig. (4)
Projection of cancer and normal samples on the space of topological measure pairs and (on the diagonal) the distribution for each measure and every sample category, for a selected threshold (A higher resolution / colour version of this figure is available in the electronic copy of the article).
Classification Performance
The classification performance was assessed for all classifiers (i.e., MLPs, SVMs and RFs) on the feature vector , with regard to all relevance thresholds, via the nested cross-validation procedure described above (see Section 2.4). In addition, we employed as benchmark three analogous classifiers (i.e., MLPs, SVMs and RFs), which were provided as input with a feature vector including the 5 first principal components (PCs) of the expression profiles of the 1673 metabolic genes.In Fig. (, we report the average accuracy, precision and recall for all tested classifiers, with respect to all relevance thresholds, as well as the benchmark classifiers on gene expression PCs, by employing the ground-truth cancer sample labels (the error bars represent the standard deviation).
Fig. (5)
From left to right: average accuracy (A), average precision on cancer samples (B) and average recall on cancer samples (C) with SVMs, MLPs and RFs. The average is computed on the test sets via a repeated nested cross-validation, for three different seeds, whereas the error bars represent the standard deviation (see Section 2.4 for additional details). The best thresholds are and . (A higher resolution / colour version of this figure is available in the electronic copy of the article).
Interestingly, the best performance is achieved for all classifiers with thresholds in the small range and , and points at the existence of an effective pruning strategy to maintain the “relevant” active metabolic pathways that discriminate cancer from normal samples, while limiting the confounding effects possibly due to noisy observations and biological variability.More in detail, the best performing classifier is provided by SVMs, which reach an average accuracy of 0.86 and 0.87, a precision of 0.87 and 0.86 and a recall of 0.86 and 0.88, for and , respectively.Interestingly, such performance is extremely similar to that obtained with SVMs on the vector of gene expression PCs (average accuracy = 0.88, precision = 0.88 and recall = 0.89) and slightly superior to that of MLPs and RFs on the same vector. This result suggests that the information extracted from the few selected topological measures on the giant component of the sample-specific metabolic network is effective in discriminating cancer from normal samples, similarly to benchmark approaches processing gene expression data (5).Finally, in Fig. (, the decision boundary of the best performing SVM classifier, i.e., obtained with and optimal hyperparameters is displayed on the first two PCs of the feature vector , from which one can see that the method is able to correctly classify also the outliers of both categories.
Fig. (6)
Decision boundary of the SVM classifier with optimal hyperparameters and threshold on the full dataset. The axes correspond to the first two principal components of the full feature vector (A higher resolution / colour version of this figure is available in the electronic copy of the article).
Conclusion
In this work, we have introduced a new computational framework for the classification of cancer samples, which combines the integration of transcriptomic data and metabolic networks with state-of-the-art machine learning approaches. This task is of practical relevance in many biomedical contexts and might pave the way for the development of automated strategies for experimental hypothesis generation. In particular, the introduction of our framework contributes to the emerging field of approaches combining sample-specific metabolic modeling with machine learning to classify cancer samples and/or to predict drug response, as recently reviewed [49, 50].More in detail, we here proved that the information on the metabolic activity of single samples, derived via integration of highly accessible RNA-seq data, can be effectively used to classify healthy and pathological states, a result that appears to be robust when the original networks are significantly pruned via a relevance threshold. All in all, this result would suggest that the useful information to determine possibly aberrant states in a given sample can be derived from the high-level (topological) properties of a relatively limited number of active processes. The identification and characterization of such processes deserve further investigation.Regarding our machine learning approach, we here relied on classical topological measures, such as degree, hierarchical degrees, average geodesic path length and assortativity, to encode the structural information of the metabolic network. Additional experiments may employ recent graph representation learning techniques [51, 52], including graph kernels [53] and convolutional neural networks on graphs [54], to automatically extract a low-dimensional feature vector of the input network.We finally remark that extensions of the framework are currently ongoing to test its applicability to more complex scenarios, involving, for instance, multiclass and multi-label classification with respect to cancer subtypes and risk categories.
Table 1
Hyperparameters grid search for the tested classifiers, i.e., MLPs, SVMs and RFs, executed via the scikit-learn Python library. Parameter names are the sklearn arguments of the related functions (default was used for the other parameters).
Authors: Alessandro Filisetti; Alex Graudenzi; Roberto Serra; Marco Villani; Rudolf M Füchslin; Norman Packard; Stuart A Kauffman; Irene Poli Journal: Theory Biosci Date: 2011-10-07 Impact factor: 1.919
Authors: Giulio Caravagna; Alex Graudenzi; Daniele Ramazzotti; Rebeca Sanz-Pamplona; Luca De Sano; Giancarlo Mauri; Victor Moreno; Marco Antoniotti; Bud Mishra Journal: Proc Natl Acad Sci U S A Date: 2016-06-28 Impact factor: 11.205