Literature DB >> 28765612

Pediatric Sarcoma Data Forms a Unique Cluster Measured via the Earth Mover's Distance.

Yongxin Chen¹, Filemon Dela Cruz², Romeil Sandhu³, Andrew L Kung², Prabhjot Mundi⁴, Joseph O Deasy¹, Allen Tannenbaum⁵.

Abstract

In this note, we combined pediatric sarcoma data from Columbia University with adult sarcoma data collected from TCGA, in order to see if one can automatically discern a unique pediatric cluster in the combined data set. Using a novel clustering pipeline based on optimal transport theory, this turned out to be the case. The overall methodology may find uses for the classification of data from other biological networking problems.

Entities: Disease Gene Species

Mesh：

Year: 2017 PMID： 28765612 PMCID： PMC5539155 DOI： 10.1038/s41598-017-07551-8

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

The present note describes a novel method for data clustering applied to the classification of pediatric sarcoma data. Namely, in this work, we combined two data sets: the first consisting of the gene expression of predominantly pediatric sarcoma patients, and the second consisting of the gene expression of adult sarcoma patients taken from the The Cancer Genome Atlas (TCGA) database. We then wanted to see if one could discern some quantifiable difference between the pediatric and adult cases. Accordingly, we applied a method based on the L 1 Earth Mover’s Distance (EMD) to the data; see Sections and below for all the details. In what follows EMD will always refer to the L 1 version of the Earth Mover’s Distance. Briefly, the proposed pipeline constructs a weighted graph based on the network topology inferred from the Human Protein Reference Database (HPRD), and then treating the graph as a Markov chain, constructs the invariant (stationary) measure, computes the pairwise distances via EMD among all the networks, and then represents the resulting distance matrix as a heat map. (We note that a heat map is a graphical representation of data in which the individual values contained in a matrix are represented as colors.) Other than an outlier (see Section below), our method was able to segregate the pediatric cases from the adult cases, i.e., we found two rather distinct clusters. We should note that ideas based on the L 1 Earth Mover’s Distance (also known as the Wasserstein 1-metric[1-3]) have already been applied in studying various properties of cancer networks. In particular, the Wasserstein 1-metric leads to a notion of curvature[4] that turns out to be positively correlated with network robustness[5-7]. This geometric network approach to studying cancer, led to some work indicating that cancer networks are more functionally robust than their normal counterparts[6, 7]. The EMD (and more generally optimal mass transport theory) is very natural for studying the properties of various weighted graphs modeling biological networks, since it gives a natural metric between probability distributions. Its use has become very widespread in recent years being employed for problems in communications, finance, engineering, and biology[1–3, 8]. This work continues this line of research, by using the distance to cluster biological data. Finally, we believe that the overall pipeline can be more generally applied in clustering many different types of network data (represented as a weighted graph). We note that we associate the invariant measure to each individual network in the class of data to be classified, and then apply the EMD. This is a distinct advantage since no preprocessing is necessary, other than normalizing the weighted graphs to ensure that they define a Markov process[9].

Results

Data

The gene expression data sets used in the present work, consist of two parts. The first part includes the gene expression of 27 patients diagnosed with pediatric-associated sarcoma and treated at Columbia University Medical Center (CUMC). Informed and signed consent for clinical and research sequencing was obtained in the context of the pediatric precision medicine program (PIPseq) established at CUMC and under the CUMC Institutional Review Board (IRB)-approved protocol AAAN8404[10]. The second part was downloaded from The Cancer Genome Atlas (TCGA) database, covering the gene expression data of 265 adult patients. We have one sample per patient for both of them, so 292 samples in total. The data sets were normalized utilizing one of the standard methods for treating RNA-Seq counts data via the variance-stabilizing transformation (VST) in the DESeq2 package for R[11]. This normalization was done amongst all of the 292 sarcoma samples. The network topology (graph adjacency matrix) was constructed using interaction information from the Human Protein Reference Database (HPRD)[12]. Specifically, we took the intersection of the genes that appear in both HPRD data and the gene expression data, and then kept the largest connected component. After discarding the redundant genes, we arrived at a gene regulatory network with 8844 nodes (genes) and 34926 edges (interactions). The average and median degrees are 7.9 and 4, respectively.

Weighted graph and invariant measure

We constructed a weighted graph for each sample using the mass action principle[13]. In particular, for given gene expression {x > 0|1 ≤ i ≤ n} the weight p on the edge (i, j) is defined asfor any j∈N(i). Here n = 8844 is the number of nodes and N(i) denotes the set of neighbors of the node i. Note by construction the matrix is a stochastic matrix and satisfies that p = 0 if the edge (i, j) doesn’t exist. Biologically speaking, mass action is similar to well established methods of differential gene co-expression[14] utilized to develop specific profile for metastatic states[13]. Here, similar to differential co-expression or correlation for analyzing co-regulation patterns in cellular pathways, mass action is based on the assumption that intensity of the interaction between two interactive genes is likely to be larger if both of them have higher expression level. For example, often in drug studies[15, 16], one studies co-regulation patterns via the differential expression of genes that are induced through a knockdown of separate gene (e.g., PI3K inhibition of BYL719 induces expression of estrogen receptor function in breast cancer[15]). Here, the same underlying principles are used when employing mass action with the added advantage that one can construct patient/sample specific networks without the usage of multiple samples needed for correlation. The stochastic matrix P defines a Markov chain[9] on the gene regulatory network. Different properties such as entropy and curvature have been considered for this object to study robustness of cancer network[6, 7, 17]. Here we consider the invariant measure (stationary distribution) of this Markov chain. The Markov chain describes the information flow between genes. When the underlying network is connected, the system will eventually reach an equilibrium and this equilibrium is described by the invariant measure. Mathematically, it is a probability vector π satisfying Thus π is a left eigenvector of P with non-negative entries that sum to 1. The value π at node i reflects the portion of contribution of that node to the entire network. In other words, the invariant measure π is a centrality measure of the significance of different genes. In general, to obtain the invariant measure, one needs to solve the linear equation (2). However, for the specific stochastic matrix in (1), π has the explicit structurewhere Z is a normalization factor (partition function) forcing π to be a probability vector. The expression (3) is very interesting. Note that the value of π at node i reflects the significance of gene i in the gene regulatory network. It consists of two components: the gene expression level x of gene i and the total gene expression of its neighbors . In other words, the invariant measure captures the key property that a gene is important if its expression level is high and it interacts with many other genes.

Optimal transport on graphs

Consider a connected undirected graph with n nodes in and m edges in . Given two densities on the graph, the original formulation of the optimal transport problem seeks a joint distribution of ρ 0 and ρ 1 minimizing the total cost , that is, Here c is the cost of moving unit mass from node i to node j and is taken to be the minimum of the number of steps to go from i to j, namely, c is the ground metric on the graph. For example, if the edge (i, j) exists, then c = 1. The minimum of this optimization problem defines a metric W 1 (the Earth Mover’s Distance) on the space probability densities on . An alternative formulation (see Methods) is defined on the fluxes given on the edges. Let be the oriented incidence matrix of , then Note that the incidence matrix is defined by associating an orientation to each edge e = (i, j) = (j, i) of the graph: one of the nodes i, j is defined to be the head and the other the tail, and then we set d = +1(−1) if i is the head (tail) of e and 0 otherwise. Compared to (4), which has n 2 variables, the above formulation has only m variables. It may greatly reduce the computational load when the graph is sparse, i.e., . This is the case in our data sets, where n = 8844 and m = 34926. In implementation, we used the standard convex optimization package CVX[18] written in Matlab, in order to numerically solve (5). We should also note that there is some very nice recent work on the fast computation of the Earth Mover’s Distance[19] based on (5).

Clustering of sarcoma data

We define a distance function between different gene expression data sets using optimal transport theory on graphs. More specifically, we define the distance between two gene expression data sets to be the W 1 optimal mass transport distance between the two invariant measures induced by the gene expressions as in (3). This distance W 1 can be computed through convex programming[1, 8]. We computed the W 1 distances between each pair of all the 292 samples (27 pediatric sarcoma and 265 adult cancer). The heat map of the resulting distance matrix is as shown in Fig. 1. The samples clearly split into two clusters; one cluster for the 27 pediatric sarcoma samples and one cluster for the 265 adult cancer patients.

Figure 1

Heat Map Showing Pediatric Cluster.

Heat Map Showing Pediatric Cluster. To visualize more clearly the two clusters, we truncate the distances using some threshold: set the value to be zero if the distance is less than the given threshold and one otherwise. The results with threshold value 0.075 and 0.1 are depicted in Figs 2 and 3, respectively. Note that there is a small gap between these two clusters, which indicates that the last sample in the pediatric sarcoma is an outlier. Figure 4 is a 3D plot of the distance matrix, from which we can see an obvious difference that distinguishes this outlier from the rest of the sarcoma samples. The clusters and the outlier can be also seen based on the histograms. Figures 5, 6 and 7 are the histograms of the distances within the pediatric sarcomas, within the adult sarcomas, and between these two age groups, respectively. Apparently the distances within the two groups (pediatric, adult) are smaller than the distances between them. In particular, the average distances within the two groups are 0.0891, 0.0665 while the average distance between them is 0.1366. The distance between the outlier and the other samples is shown in Fig. 8, with mean value 0.2424, which is significantly larger than the average. See our discussion in the next section for further analysis of these results.

Figure 2

Heat Map Showing Pediatric Cluster with Threshold Value 0.075.

Figure 3

Heat Map Showing Pediatric Cluster with Threshold Value 0.1.

Figure 4

3D Plot Showing Pediatric Cluster.

Figure 5

Distances within the Pediatric Cluster.

Figure 6

Distances within the Adult Cluster.

Figure 7

Distances between Pediatric Cluster and Adult Cluster.

Figure 8

Distances between the Outlier (PIP13-81192) and the other Samples.

Heat Map Showing Pediatric Cluster with Threshold Value 0.075. Heat Map Showing Pediatric Cluster with Threshold Value 0.1. 3D Plot Showing Pediatric Cluster. Distances within the Pediatric Cluster. Distances within the Adult Cluster. Distances between Pediatric Cluster and Adult Cluster. Distances between the Outlier (PIP13-81192) and the other Samples.

Discussion

Sarcomas represent a heterogeneous group of malignant solid tumors of connective tissue. Sarcomas comprise approximately 1.5% of all malignant tumors diagnosed in adults and over 7% of cancers in children[20]. Although the diversity of sarcoma subtypes can be encountered across the age spectrum, there exists a pattern of sarcoma subtypes that significantly distributes between adults and children. For example, osteosarcoma and Ewing sarcoma (malignant bone tumors) are predominant in children and early adults, whereas undifferentiated pleomorphic sarcoma (previously called malignant fibrous histiocytoma), liposarcoma and leiomyosarcoma are extremely rare in children[21, 22]. In addition to the observation that particular sarcoma subtypes predominate in either childhood or in adulthood, there are also differences in the clinical outcomes of adult and childhood sarcoma patients that extend beyond the differences in treatment regimens between adult and childhood sarcomas[21, 23–25]. With the emergence of next-generation sequencing technologies, we are afforded the opportunity to evaluate the biologic differences between pediatric-associated and adult-associated sarcomas. In our analysis of 27 sarcoma cases treated at CUMC, only 26 of the 27 original cases would be categorized as a pediatric-associated sarcoma. Interestingly, one case originally included in the pediatric set segregated as an . This case represents a 25 year old female with a history of multiply relapsed, metastatic alveolar soft part sarcoma (ASPS). ASPS is a rare sarcoma subtype comprising 0.2–0.9% of all soft tissue sarcomas[26]. ASPS is extremely rare in childhood, and is more commonly diagnosed in adolescence and young adulthood (15–35 years of age)[20]. A second adult case included in the pediatric cohort is from a 38 year old male with metastatic synovial sarcoma. In contrast to the previous adult cases of ASPS, this case segregated with the pediatric cohort. Synovial sarcoma is a soft tissue sarcoma with a peak incidence in the 3rd decade of life, and with about 1/3 of cases occurring within the pediatric age range[27]. Synovial sarcoma is more common than ASPS and is the most frequent non-rhabdomyosarcomatous soft tissue sarcoma in adolescents and young adults[28]. Although historical differences in the approach to therapy between pediatric and adult oncologists have existed for the treatment of sarcomas and other tumors, there has been acknowledgement in the adult oncology community of the clinical utility of pediatric-based regimens for the treatment of sarcomas occurring in adulthood[29, 30]. However, despite use of more dose-intense chemotherapeutic approaches to the treatment of sarcomas in adulthood, pediatric-associated sarcomas diagnosed and treated in adulthood continue to have inferior outcomes compared to treatment in childhood[31, 32]. These observations suggest that there may exist age-dependent differences in the biology of sarcomas. However, it is unclear what the thresholds for age may be that would contribute to differential responses to treatment and clinical outcome as the cutoffs for age and the definition of “adult age” has varied in the literature. The results from this analysis suggest that the sarcoma subtype may supersede, in this instance, the contribution of age to the biologic behavior and genomic signature. So from this classification scheme, it seems that there are indeed biologic differences between sarcoma subtypes that are generally associated with childhood (such as synovial sarcoma) versus those more commonly associated with adulthood (such as ASPS), and provides a rationale for the use of pediatric regimens for the treatment of these diseases regardless of the patient’s age. Genomic characterization of a larger cohort of pediatric-associated and adult-associated sarcomas will be imperative in specifically clarifying the genomic lesions that result in the clinical differences in behavior of sarcomas across the age spectrum. In any event, we did manage to cluster 26 out of the 27 CUMC cases from the TCGA data using our methodology. In our Supplemental Information file, we have two other examples. The first illustrates the methodology applied to clustering breast cancer data (triple negative and normal). The second using synthetic “gene expression” networks shows the importance of topology in clustering. EMD has the nice feature of explicitly utilizing the topology of the network under consideration. We should finally note that the pipeline sketched in Fig. 9 is quite general and may be quite useful in clustering various biological networks. These typically may be represented as weighted graphs, and thus after suitable normalization as Markov chains for which there exist the corresponding stationary measures. Optimal mass transport theory realized by the Earth Mover’s Distance seems to be an ideal tool for capturing distances among these measures, and thus leads to a natural clustering/classification framework. Several interesting biological graphs as suggested by one of the reviewers could include those based on evolutionary distance between genes, structural similarity in within same fold family, percent of shared functional sites, even predicted, and percent of shared protein domains.

Figure 9

Overall Sketch of Method.

Methods

Overall sketch

Figure 9 illustrates the overall pipeline of the clustering methodology described in the previous sections. The basic idea is that once one has defined the network topology (in this case via the Human Protein Reference Database), and the weights connecting the nodes (derived here from the mass action principle), one can use in a straightforward manner an invariant of each network, and then compute the distance matrix defined by the EMD or Wasserstein 1-metric. In the next section, we will review the definition and properties of this central mathematical object underpinning our analysis.

Earth Mover’s Distance

In this section, we briefly review the mathematics of the Earth’s Mover’s Distance (EMD) from optimal mass transport theory, the key method on which all the previous results were based. The classical Earth Mover’s Distance was formulated by Monge in 1781 to solve the problem of moving a pile of soil to a excavation site with the least amount of work relative to some cost. This is illustrated in Fig. 10. For full details as well as long lists of references, see the monographs[1-3].

Figure 10

Classical Earth Mover’s Problem: The dashed arrow indicates the transport map between the densities ρ 0 and ρ 1.

Classical Earth Mover’s Problem: The dashed arrow indicates the transport map between the densities ρ 0 and ρ 1. Mathematically, we let ρ 0 and ρ 1 denote two probability densities on . This means that with ρ ≥ 0 for i = 0, 1, such that Then the Earth Movers’ Distance (also called the Wasserstein 1-metric, W 1) between them iswhere Π(ρ 0, ρ 1) denotes the set of couplings between ρ 0 and ρ 1. The Wasserstein-1 distance has the dual formulation[8] Here Clearly when f is differentiable, is equivalent to . So formally, the above can be rewritten as One can then take the dual once again, i.e., starting from (8), one sees thatof W 1 with flux u being the optimization variable. The above “dual of the dual” method can be applied to transport problems on graphs to show the equivalence between (4) and (5) by replacing the metric by c and the divergence operator by D. In so doing, one gets a tremendous saving in computational burden since equation (4) involves solving systems on the order of the of the number of nodes, while equation (5) is of the order of the number of edges. In our specific case, we had 8844 nodes (genes) and 34926 edges (interactions), that is, we save in this manner in the number of variables treated.

20 in total

1. PI3K inhibition results in enhanced estrogen receptor function and dependence in hormone receptor-positive breast cancer.

Authors: Ana Bosch; Zhiqiang Li; Anna Bergamaschi; Haley Ellis; Eneda Toska; Aleix Prat; Jessica J Tao; Daniel E Spratt; Nerissa T Viola-Villegas; Pau Castel; Gerard Minuesa; Natasha Morse; Jordi Rodón; Yasir Ibrahim; Javier Cortes; Jose Perez-Garcia; Patricia Galvan; Judit Grueso; Marta Guzman; John A Katzenellenbogen; Michael Kharas; Jason S Lewis; Maura Dickler; Violeta Serra; Neal Rosen; Sarat Chandarlapaty; Maurizio Scaltriti; José Baselga
Journal: Sci Transl Med Date: 2015-04-15 Impact factor: 17.956

Review 2. Alveolar Soft Part Sarcoma.

Authors: Omar I Jaber; Patricia A Kirby
Journal: Arch Pathol Lab Med Date: 2015-11 Impact factor: 5.534

3. Signalling entropy: A novel network-theoretical framework for systems analysis and interpretation of functional omic data.

Authors: Andrew E Teschendorff; Peter Sollich; Reimer Kuehn
Journal: Methods Date: 2014-03-24 Impact factor: 3.608

4. National survival trends of young adults with sarcoma: lack of progress is associated with lack of clinical trial participation.

Authors: Archie Bleyer; Michael Montello; Troy Budd; Scott Saxman
Journal: Cancer Date: 2005-05-01 Impact factor: 6.860

5. IGF-1R and mTOR Blockade: Novel Resistance Mechanisms and Synergistic Drug Combinations for Ewing Sarcoma.

Authors: Salah-Eddine Lamhamedi-Cherradi; Brian A Menegaz; Vandhana Ramamoorthy; Deeksha Vishwamitra; Ying Wang; Rebecca L Maywald; Adriana S Buford; Izabela Fokt; Stanislaw Skora; Jing Wang; Aung Naing; Alexander J Lazar; Eric M Rohren; Najat C Daw; Vivek Subbiah; Robert S Benjamin; Ravin Ratan; Waldemar Priebe; Antonios G Mikos; Hesham M Amin; Joseph A Ludwig
Journal: J Natl Cancer Inst Date: 2016-08-30 Impact factor: 13.506

6. Synovial sarcoma: a retrospective analysis of 271 patients of all ages treated at a single institution.

Authors: Andrea Ferrari; Alessandro Gronchi; Michela Casanova; Cristina Meazza; Lorenza Gandola; Paola Collini; Laura Lozza; Rossella Bertulli; Patrizia Olmi; Paolo G Casali
Journal: Cancer Date: 2004-08-01 Impact factor: 6.860

Review 7. Role of chemotherapy in pediatric nonrhabdomyosarcoma soft-tissue sarcomas.

Authors: Andrea Ferrari
Journal: Expert Rev Anticancer Ther Date: 2008-06 Impact factor: 4.512

8. Chemotherapy is associated with improved survival in adult patients with primary extremity synovial sarcoma.

Authors: Fritz C Eilber; Murray F Brennan; Frederick R Eilber; Jeffery J Eckardt; Stephen R Grobmyer; Elyn Riedel; Charles Forscher; Robert G Maki; Samuel Singer
Journal: Ann Surg Date: 2007-07 Impact factor: 12.969

9. Implementation of next generation sequencing into pediatric hematology-oncology practice: moving beyond actionable alterations.

Authors: Jennifer A Oberg; Julia L Glade Bender; Maria Luisa Sulis; Danielle Pendrick; Anthony N Sireci; Susan J Hsiao; Andrew T Turk; Filemon S Dela Cruz; Hanina Hibshoosh; Helen Remotti; Rebecca J Zylber; Jiuhong Pang; Daniel Diolaiti; Carrie Koval; Stuart J Andrews; James H Garvin; Darrell J Yamashiro; Wendy K Chung; Stephen G Emerson; Peter L Nagy; Mahesh M Mansukhani; Andrew L Kung
Journal: Genome Med Date: 2016-12-23 Impact factor: 11.117

10. Differential network entropy reveals cancer system hallmarks.

Authors: James West; Ginestra Bianconi; Simone Severini; Andrew E Teschendorff
Journal: Sci Rep Date: 2012-11-13 Impact factor: 4.379

7 in total

1. Matricial Wasserstein-1 Distance.

Authors: Yongxin Chen; Tryphon T Georgiou; Lipeng Ning; Allen Tannenbaum
Journal: IEEE Control Syst Lett Date: 2017-04-28

2. aWCluster: A Novel Integrative Network-Based Clustering of Multiomics for Subtype Analysis of Cancer Data.

Authors: Maryam Pouryahya; Jung Hun Oh; Pedram Javanmard; James C Mathews; Zehor Belkhatir; Joseph O Deasy; Allen R Tannenbaum
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2022-06-03 Impact factor: 3.702

3. A novel kernel Wasserstein distance on Gaussian measures: An application of identifying dental artifacts in head and neck computed tomography.

Authors: Jung Hun Oh; Maryam Pouryahya; Aditi Iyer; Aditya P Apte; Joseph O Deasy; Allen Tannenbaum
Journal: Comput Biol Med Date: 2020-03-26 Impact factor: 4.589

4. Molecular phenotyping using networks, diffusion, and topology: soft tissue sarcoma.

Authors: James C Mathews; Maryam Pouryahya; Caroline Moosmüller; Yannis G Kevrekidis; Joseph O Deasy; Allen Tannenbaum
Journal: Sci Rep Date: 2019-09-27 Impact factor: 4.379

5. Reproducibility of radiomic features using network analysis and its application in Wasserstein k-means clustering.

Authors: Jung Hun Oh; Aditya P Apte; Evangelia Katsoulakis; Nadeem Riaz; Vaios Hatzoglou; Yao Yu; Usman Mahmood; Harini Veeraraghavan; Maryam Pouryahya; Aditi Iyer; Amita Shukla-Dave; Allen Tannenbaum; Nancy Y Lee; Joseph O Deasy
Journal: J Med Imaging (Bellingham) Date: 2021-04-30

6. Pan-Cancer Prediction of Cell-Line Drug Sensitivity Using Network-Based Methods.

Authors: Maryam Pouryahya; Jung Hun Oh; James C Mathews; Zehor Belkhatir; Caroline Moosmüller; Joseph O Deasy; Allen R Tannenbaum
Journal: Int J Mol Sci Date: 2022-01-19 Impact factor: 6.208

7. vWCluster: Vector-valued optimal transport for network based clustering using multi-omics data in breast cancer.

Authors: Jiening Zhu; Jung Hun Oh; Joseph O Deasy; Allen R Tannenbaum
Journal: PLoS One Date: 2022-03-14 Impact factor: 3.240

7 in total