Literature DB >> 23468608

Biomolecular events in cancer revealed by attractor metagenes.

Wei-Yi Cheng1, Tai-Hsien Ou Yang, Dimitris Anastassiou.   

Abstract

Mining gene expression profiles has proven valuable for identifying signatures serving as surrogates of cancer phenotypes. However, the similarities of such signatures across different cancer types have not been strong enough to conclude that they represent a universal biological mechanism shared among multiple cancer types. Here we present a computational method for generating signatures using an iterative process that converges to one of several precise attractors defining signatures representing biomolecular events, such as cell transdifferentiation or the presence of an amplicon. By analyzing rich gene expression datasets from different cancer types, we identified several such biomolecular events, some of which are universally present in all tested cancer types in nearly identical form. Although the method is unsupervised, we show that it often leads to attractors with strong phenotypic associations. We present several such multi-cancer attractors, focusing on three that are prominent and sharply defined in all cases: a mesenchymal transition attractor strongly associated with tumor stage, a mitotic chromosomal instability attractor strongly associated with tumor grade, and a lymphocyte-specific attractor.

Entities:  

Mesh:

Year:  2013        PMID: 23468608      PMCID: PMC3581797          DOI: 10.1371/journal.pcbi.1002920

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


Introduction

Despite their type-specific features, cancers share some common traits, or “hallmarks,” related to, e.g., the abilities of some cancer cells to divide uncontrollably or to invade surrounding tissues [1]. Furthermore, it has been recognized that gene expression signatures resulting from analysis of cancer datasets can serve as surrogates of cancer phenotypes [2]. Therefore, it is reasonable to hypothesize that computational analysis of rich biomolecular cancer datasets may reveal signatures that are shared across many cancer types and are associated with specific cancer phenotypes. Such rich datasets become publicly available at an increasing rate from many sources, such as The Cancer Genome Atlas (TCGA). However, attempts to identify any such robust “bioinformatic hallmarks” of cancer have so far been largely unsuccessful. Gene signatures may occasionally be found to exhibit similarities across multiple cancer types. However, to our knowledge no algorithm has ever produced a set of nearly identical signatures after independently and separately analyzing datasets from different cancer types. There are various ways by which modules of co-expressed genes can be identified from rich datasets, some of which may be within the context of regulatory network discovery [3]. Clustering approaches can classify a selected set of genes into subsets each of which contains mutually related genes. Related techniques can also be used to classify samples into cancer subtypes [4]–[6], each characterized by a set of characteristic genes. One of the most powerful computational approaches for this task has been nonnegative matrix factorization (NMF) [7] combined with consensus clustering [8], resulting in successful subtype identification in several types of cancer. The main objective addressed by techniques such as NMF is to reduce dimensionality by identifying a number of metagenes jointly representing the gene expression dataset as accurately as possible, in lieu of the whole set of individual genes. Each metagene in NMF is defined as a positive linear combination of the individual genes, so that its expression level is an accordingly weighted average of the expression levels of the individual genes. The identity of each resulting metagene is influenced by the presence of other metagenes within the objective of overall dimensionality reduction achieved by joint optimization. By contrast, if the aim is exclusively to identify metagenes as surrogates of biomolecular events, then a fully unconstrained algorithm should be devised, without any effort to achieve dimensionality reduction, classification, mutual exclusivity, orthogonality, regulatory interaction inference, etc. We can consider, for example, a hypothetical case in which we have found a cluster consisting of a number of co-expressed genes in a rich gene expression dataset. We may wish to scrutinize and “sharpen” this co-expression trying to identify the “heart” (core) of the genes that are most strongly co-expressed in that case. In the absence of a defining phenotype, we can continue applying an unsupervised methodology, as follows: We can first define a consensus metagene from the average expression levels of all genes in the cluster, and rank all the individual genes in terms of their association (defined numerically by some form of correlation) with that metagene. We can then replace the member genes of the cluster with an equal number of the top-ranked genes. Some of the original genes may naturally remain as members of the cluster, but some may be replaced, as this process will “attract” some other genes that are more strongly correlated with the cluster. We can now define a new metagene defined by the average expression levels of the genes in the newly defined cluster, and re-rank all the individual genes in terms of their association with that new metagene; and so on. It is intuitively reasonable to expect that this iterative process will eventually converge to a cluster that contains precisely the genes that are most associated with the metagene of the same cluster, so that any other individual genes will be less strongly associated with the metagene. We can think of this particular cluster defined by the convergence of this iterative process as an “attractor,” i.e., a module of co-expressed genes to which many other gene sets with close but not identical membership will converge using the same computational methodology. The above description represents a simplified conceptual introduction of the computational methodology presented in this paper. Rather than using the average of the expression values in gene clusters of a particular size, the “attractors” are metagenes defined as weighted averages of all genes where each individual gene has a nonnegative weight, just like the metagenes defined using NMF [7]. We found that, given a rich (loosely defined as containing at least 200 samples) dataset represented by a gene expression matrix, such metagenes can be naturally identified as stable and precise attractors using a simple iterative approach. We experimentally verified these nice convergence properties without any exception after trying numerous times the method described in this paper on such rich datasets. This methodology is totally unsupervised, as it does not make use of any phenotypic association. As we show in this paper, however, once identified, a metagene attractor is likely to be found associated with a phenotype. We found that several attractor metagenes are present in nearly identical form in multiple cancer types. This provides an additional opportunity to combine the powers of a large number of rich datasets to focus, at an even sharper level, on the core genes of the underlying mechanism. For example, this methodology can precisely point to the causal (driver) oncogenes within amplicons to be among very few candidate genes. Importantly, this can be done from rich gene expression data, which already exist in abundance, without making any use of sequencing data. We identified several attractors, each of which has the potential to lead to corresponding testable biological hypotheses after scrutinizing their top-ranked genes and finding a putative underlying mechanism. For the purposes of this paper we present the general methodology for the benefit of the research community together with a listing of the attractors in six datasets from three cancer types (ovarian, colon, breast). Here, we focus on a few interesting cancer-associated attractors that we found present in multiple cancer types. Particular emphasis is given to what we consider to be three key “bioinformatic hallmarks” of cancer, related to the ability of cancer cells to invade surrounding tissues; to divide uncontrollably; and the ability of the organism to recruit the immune system to fight cancer: a tumor stage-associated mesenchymal transition attractor, a tumor grade-associated mitotic chromosomal instability (mitotic CIN) attractor, and a lymphocyte-specific attractor.

Results

Derivation of Attractor Metagenes

Given a nonnegative measure of pairwise association between genes and , we define an attractor metagene to be a linear combination of the individual genes with weights . The association measure is assumed to have minimum possible value 0 and maximum possible value 1, so the same is true for the weights. It is also assumed to be scale-invariant, therefore it is not necessary for the weights to be normalized so that they add to 1, and the metagenes can still be thought of as expressing a normalized weighted average of the expression levels of the individual genes. See Materials and Methods for the choice of the measure . According to this definition, the genes with the highest weights in an attractor metagene will have the highest association with the metagene (and, by implication, they will tend to be highly associated among themselves) and so they will often represent a biomolecular event reflected by the co-expression of these top genes. This can happen, e.g., when a biological mechanism is activated, or when a copy number variation (CNV), such as an amplicon, is present, in some of the samples included in the expression matrix. In the following we use the term “attractor” for simplicity to refer to an attractor metagene, and the term “top genes” to refer to the genes with the highest weights in the attractor. The definition of an attractor metagene can readily be generalized to include features other than gene expression, such as methylation values. It can also be used in datasets of any objects (not necessarily genes) characterized by any type of feature vectors, with applications in other disciplines, such as social and economic sciences. The computational problem of identifying attractor metagenes given an expression matrix can be addressed heuristically using a simple iterative process: Starting from a particular seed (or “attractee”) metagene , a new metagene is defined in which the new weights are . The same process is then repeated in the next iteration resulting in a new set of weights, and so forth. In all gene expression datasets that we tried we found that this process converges to a limited number of stable attractors. Each attractor is defined by a precise set of weights, which are reached with high accuracy typically within 10 or 20 iterations. This algorithmic behavior with nice convergence properties is not surprising, because if a metagene represents co-expressed genes, then the next iteration will naturally “attract” other similarly co-expressed genes, and so forth, until there are no other genes more associated with the top genes than those genes themselves. Furthermore, the set of the few genes with the highest weight are likely to represent the “heart” (core) of the underlying biomolecular event. In support of this concept, the association of any of the top-ranked individual genes with the attractor metagene is consistently and significantly higher than the pairwise association between any of these genes, suggesting that the set of these top genes jointly comprise a proxy representing a biomolecular event better than each of the individual genes would. Indeed, related versions of the signatures identified by attractors in this paper have been previously identified in various contexts in individual cancer types, often intermingled with additional genes. However, the contribution of our work is that these signatures are found as pan-cancer biomolecular events, sharply pointing to the underlying mechanism. Therefore the top genes of the attractors will be appropriate for being used as biomarkers or for understanding the underlying biology. For example, one of the attractors that we identified (the “mitotic chromosomal instability” attractor, described below) has previously been found in approximate forms among sets of genes described generally [9] as “proliferation” or “cell cycle related” markers, while the actual attractor points much more sharply to particular elements in the structure of the kinetochore-microtubule interface. A reasonable implementation of an “exhaustive” search of attractor metagenes is to start from each individual gene as a seed (“attractee”) assigning a weight of 1 to that gene, and 0 to all the other genes. Each gene participating in a particular co-expression event will then lead to the same attractor when used as attractee. The computational implementation of the algorithm is described in Materials and Methods. We note that a dual method can be used to identify attractor “metasamples” as representatives of subtypes, and we can also combine such metasamples with the attractor metagenes in various ways to achieve biclustering, but this topic is not examined in this paper. We analyzed six datasets, two from ovarian cancer, two from breast cancer and two from colon cancer (Supplementary Text S1). In each case, we identified general (Supplementary Table S1) and genomically localized (Supplementary Table S2) attractors and we found that many among them appear in similar forms in all six datasets using particular merging and ranking criteria in each case (Materials and Methods and Supplementary Text S1). Following are descriptions of some of our results, starting with the three strongest multi-cancer attractors.

Mesenchymal Transition Attractor Metagene

This attractor contains mostly epithelial-mesenchymal transition (EMT)-associated genes. Table 1 provides a listing of the top 100 genes based on their average mutual information (Materials and Methods) with their corresponding attractor metagenes.
Table 1

Top 100 genes of the mesenchymal transition attractor based on six datasets.

RankGene SymbolAvg MIRankGene SymbolAvg MI
1COL5A20.81451SULF10.505
2VCAN0.77552LOXL10.502
3SPARC0.76653PRRX10.502
4THBS20.75854PPAPDC1A0.499
5FBN10.74955COL10A10.498
6COL1A20.74956ITGA110.495
7COL5A10.74757NTM0.494
8FAP0.73458MXRA80.494
9AEBP10.71159FIBIN0.493
10CTSK0.70960WISP10.483
11COL3A10.68861RCN30.483
12COL1A10.68362TNFAIP60.481
13SERPINF10.67463ECM20.480
14COL6A30.66964HTRA10.480
15CDH110.66365EFEMP20.478
16GLT8D20.65866MXRA50.474
17LUM0.65467ACTA20.472
18MMP20.65468LOX0.470
19DCN0.65069ITGBL10.466
20CCDC800.63770PMP220.465
21POSTN0.63171P4HA30.464
22CTHRC10.61672PTRF0.463
23ADAM120.61373CALD10.460
24COL6A20.60874HEG10.458
25MSRB30.60875NEXN0.455
26OLFML2B0.60776NID20.455
27INHBA0.60077TAGLN0.455
28FSTL10.60078FAM26E0.452
29SFRP20.59679ZNF5210.452
30SNAI20.57780SFRP40.451
31CRISPLD20.57481PALLD0.450
32PCOLCE0.57182OLFML10.447
33PDGFRB0.56783FILIP1L0.447
34BGN0.56584TIMP30.445
35COL12A10.56085SPON20.443
36ANGPTL20.55586SPOCK10.443
37COPZ20.55387COL8A20.441
38CMTM30.54988GPC60.438
39ASPN0.54789PDPN0.437
40FN10.54590GFPT20.436
41CNRIP10.54091LHFP0.436
42FNDC10.53892GREM10.436
43LRRC150.53393TGFB1I10.435
44COL11A10.52994C1S0.433
45ANTXR10.52895EDNRA0.432
46RAB310.52796GAS10.431
47FRMD60.52497NOX40.431
48TSHZ30.52098FBLN20.428
49THY10.51999TCF40.428
50NNMT0.519100NUAK10.427
The consistency of the attractor is established by the fact (Supplementary Table S1) that there are many genes (COL5A2, COL1A2, SPARC, CTSK, FBN1, VCAN, AEBP1, SERPINF1) that are among the top 50 genes in the attractors of all six datasets. The corresponding P value is less than 10−7 by permutation test (Materials and Methods). Similar results are found in other solid cancer types in all cases that we tried. This is a stage-associated attractor, in which the signature is significantly present only when a particular level of invasive stage, specific to each cancer type, has been reached. Supplementary Table S3 demonstrates this phenomenon in three cancer datasets from different types (breast, ovarian and colon) that were annotated with clinical staging information, by providing a listing of differentially expressed genes, ranked by fold change, when ductal carcinoma in situ (DCIS) progresses to invasive ductal carcinoma; ovarian cancer progresses to stage III; and colon cancer progresses to stage II. In all three cases, the attractor is highly enriched among the top genes. Specifically, among the top 100 differentially expressed genes, the number of attractor genes included in Table 1 is 47 in breast cancer, 42 in ovarian cancer and 37 in colon cancer. The corresponding P values are 2×10−93, 4×10−80 and 8×10−78, respectively. This attractor has been previously identified with remarkable accuracy as representing a particular kind of mesenchymal transition of cancer cells present in all types of solid cancers tested leading to a published list of top 64 genes [10], [11]. This list was generated using a supervised algorithm using association with tumor stage. Indeed 52 of these top 64 genes also appear in Table 1 (P<10−114), and furthermore all top 19 genes of Table 1 are among the 64. We found that most of the genes of the signature were expressed by the cancer cells themselves, and not by the surrounding stroma, at least in a neuroblastoma xenograft model that we tried [11]. We also found that the signature is associated with prolonged time to recurrence in glioblastoma [12]. Related versions of the same signature were previously found to be associated with resistance to neoadjuvant therapy in breast cancer [13]. These results are consistent with the finding that EMT induces cancer cells to acquire stem cell properties [14]. It has been hypothesized that EMT is a key mechanism for cancer cell invasiveness and motility [15]–[17]. The attractor, however, appears to represent a more general phenomenon of transdifferentiation present even in nonepithelial cancers such as neuroblastoma, glioblastoma and Ewing's sarcoma. Although similar signatures are often labeled as “stromal,” because they contain many stromal markers such as α-SMA and fibroblast activation protein, the fact that most of the genes of the signature were expressed by xenografted cancer cells [11], and not by mouse stromal cells, suggests that this particular attractor of coordinately expressed genes represents cancer cells having undergone a mesenchymal transition. The signature may indicate a non-fibroblastic transition, as occurs in glioblastoma, in which case collagen COL11A1 is not co-expressed with the other genes of the attractor. We have hypothesized that a full fibroblastic transition of the cancer cells occurs when cancer cells encounter adipocytes [11], in which case they may well assume the duties of cancer-associated fibroblasts (CAFs) in some tumors [1]. In that case, the best proxy of the signature [10] is COL11A1 and the strongly co-expressed genes THBS2 and INHBA. Indeed, the 64 genes of the previously identified signature were found from multi-cancer analysis [10] as the genes whose expression is consistently most associated with that of COL11A1. The only EMT-inducing transcription factor found upregulated in the xenograft model [11] is SNAI2 (Slug), and it is also the one most associated with the signature in publicly available datasets. We also found that the microRNAs most highly associated with this attractor are miR-214, miR-199a, and miR-199b. Interestingly, miR-214 and miR-199a were found to be jointly regulated by another EMT-inducing transcription factor, TWIST1 [18]. The expression of the mesenchymal transition attractor indicates that the tumor is actively invasive at the specific sample site, so its prognostic value is cancer type and stage specific. As an example, we analyzed an oral squamous cell carcinoma dataset deposited in the Gene Expression Omnibus (GEO) under accession number GSE25104. The corresponding Kaplan-Meier survival curve (P = 0.0066) is shown in Figure 1.
Figure 1

Kaplan-Meier curves of mesenchymal transition attractor metagene in oral squamous cell carcinoma dataset.

Gene expression data from 57 patients (GSE25104) were divided into two groups: high mesenchymal transition metagene level and low mesenchymal transition metagene level depending on whether the metagene expression value exceeding the mean of the 57 patients. The P value of the association was determined by log-rank test.

Kaplan-Meier curves of mesenchymal transition attractor metagene in oral squamous cell carcinoma dataset.

Gene expression data from 57 patients (GSE25104) were divided into two groups: high mesenchymal transition metagene level and low mesenchymal transition metagene level depending on whether the metagene expression value exceeding the mean of the 57 patients. The P value of the association was determined by log-rank test.

Mitotic CIN Attractor Metagene

This attractor contains mostly kinetochore-associated genes. Table 2 provides a listing of the top 100 genes based on their average mutual information (Materials and Methods) with their corresponding attractor metagenes, starting from CENPA, which encodes for a histone H3-like centromeric protein.
Table 2

Top 100 genes of the mitotic CIN attractor based on six datasets.

RankGene SymbolAvg MIRankGene SymbolAvg MI
1CENPA0.72051CDCA80.532
2DLGAP50.69352CDC450.528
3MELK0.68453KIF18A0.524
4BUB10.67454HMMR0.506
5KIF2C0.66055TOP2A0.505
6KIF20A0.65856CENPF0.503
7KIF4A0.65657ZWINT0.503
8CCNA20.65458PLK10.501
9CCNB20.65259RAD51AP10.501
10NCAPG0.64960FAM83D0.498
11TTK0.64261E2F80.497
12CEP550.63862CENPE0.497
13CCNB10.63263MKI670.492
14CDK10.62964CENPN0.491
15HJURP0.62665MAD2L10.489
16CDC200.62466CHEK10.486
17CDCA50.61567GTSE10.477
18NCAPH0.61568RAD510.475
19BUB1B0.60969SGOL20.474
20KIF230.59270PARPBP0.469
21KIF110.59171TRIP130.467
22BIRC50.58972SHCBP10.465
23NUF20.58773DTL0.465
24TPX20.58674CENPL0.462
25AURKB0.58275FEN10.461
26RACGAP10.58076FANCI0.461
27NUSAP10.58077FBXO50.459
28ASPM0.57978ECT20.457
29MCM100.57979MND10.456
30PRC10.57680CDC25C0.456
31DEPDC1B0.57281PBK0.456
32UBE2C0.56982KPNA20.452
33UBE2T0.56783RAD54L0.452
34NEK20.56684ESPL10.447
35FOXM10.56585CDCA20.446
36NDC800.55686FAM64A0.440
37CDCA30.55687CENPK0.436
38FAM54A0.55388MYBL20.435
39ANLN0.55189SPAG50.434
40KIF150.54890EZH20.431
41STIL0.54791SMC40.430
42EXO10.54292TACC30.428
43AURKA0.54093C11orf820.427
44PTTG10.53994MASTL0.426
45OIP50.53995ASF1B0.426
46RRM20.53996PTTG3P0.425
47DEPDC10.53997CENPW0.424
48CDKN30.53898ORC10.424
49KIF140.53799HELLS0.422
50SPC250.534100TK10.421
The consistency of the attractor is established by the fact (Supplementary Table S1) that there are many genes (CENPA, DLGAP5, KIF2C, CCNB2, MELK, CCNA2, KIF20A, HJURP, NUSAP1, BUB1, TTK, KIF11, NCAPH) that are among the top 50 genes in the attractors of all six datasets. The corresponding P value is less than 10−7 by permutation test (Materials and Methods). Similar results are found in other solid cancer types in all cases that we tried. Contrary to the stage-associated mesenchymal transition attractor, this is a grade-associated attractor, in which the signature is significantly present only when an intermediate level of tumor grade is reached. Supplementary Table S4 demonstrates this phenomenon in three cancer datasets from different types (breast, ovarian and bladder) that were annotated with tumor grade information, by providing a listing of differentially expressed genes, ranked by fold change, when grade G2 is reached. In all three cases, the attractor is highly enriched among the top genes. Specifically, among the top 100 differentially expressed genes, the number of attractor genes included Table 2 is 40 in breast cancer, 38 in ovarian cancer and 27 in colon cancer. The corresponding P values are 4×10−74, 3×10−69 and 3×10−49, respectively. Consistently, a similar “gene expression grade index” signature [19] was previously found differentially expressed between histologic grade 3 and histologic grade 1 breast cancer samples. Furthermore, that same signature [19] was found capable of reclassifying patients with histologic grade 2 tumors into two groups with high versus low risks of recurrence. This attractor is associated with chromosomal instability (CIN), as evidenced from the fact that another similar gene set comprising a “signature of chromosomal instability” [20] was previously derived from multiple cancer datasets purely by identifying the genes that are most correlated with a measure of aneuploidy in tumor samples. This led to a 70-gene signature referred to as “CIN70.” Indeed 31 of these 70 genes appear in Table 2 (P<10−53). However, several top genes of the attractor, such as CENPA, DLGAP5, KIF2C, BUB1 and CCNA2 are not present in the CIN70 list. Mitotic CIN is increasingly recognized [21] as a widespread multi-cancer phenomenon. The attractor is characterized by overexpression of kinetochore-associated genes, which is known [22] to induce CIN. Overexpression of several of the genes of the attractor, such as the top gene CENPA [23], as well as MAD2L1 [24] and TPX2 [25], has also been independently previously found associated with CIN. Included in the mitotic CIN attractor are key components of mitotic checkpoint signaling [26], such as BUB1B, MAD2L1 (aka MAD2), CDC20, and TTK (aka MSP1). Also among the genes in the attractor is MKI67 (aka Ki-67), which has been widely used as a proliferation rate marker in cancer. Among transcription factors, we found MYBL2 (aka B-Myb) and FOXM1 to be strongly associated with the attractor. They are already known to be sequentially recruited to promote late cell cycle gene expression [27] to prepare for mitosis. Inactivation of the retinoblastoma (RB) tumor suppressor promotes CIN [28] and the expression of the attractor signature. Indeed, a similar expression of a “proliferation gene cluster [29]” was found strongly associated with the human papillomavirus E7 oncogene, which abrogates RB protein function and activates E2F-regulated genes. Consistently, many among the genes of the attractor correspond to E2F pathway genes controlling cell division or proliferation. Among the E2F transcription factors, we found that E2F8 and E2F7 are most strongly associated with the attractor.

Lymphocyte-Specific Attractor Metagene

This attractor consists mainly of lymphocyte-specific genes with prominent presence of CD53, PTPRC, LAPTM5, DOCK2, LCP2 and IL10RA. It is strongly associated with the expression of microRNA miR-142 as well as with particular hypermethylated and hypomethylated gene signatures [30]. There is also significant overlap between the sets of hypomethylated and overexpressed genes, suggesting that their expression is triggered by hypomethylation. Gene set enrichment analysis reveals that the attractor is found enriched in genes known to be preferentially expressed in differentiation into lymphocytes [31]. Table 3 provides a listing of the top 100 genes of the lymphocyte-specific attractor based on their average mutual information (Materials and Methods) with their corresponding attractor metagenes.
Table 3

Top 100 genes of the lymphocyte-specific attractor based on six datasets.

RankGene SymbolAvg MIRankGene SymbolAvg MI
1PTPRC0.78251NCF10.560
2CD530.76852CCL50.557
3LCP20.73953LST10.557
4LAPTM50.70854CD3D0.553
5DOCK20.69955RCSD10.548
6IL10RA0.69956FGL20.538
7CYBB0.69857HCST0.538
8CD480.69158MARCH10.538
9ITGB20.67959FERMT30.536
10EVI2B0.67560FCGR2B0.533
11MS4A6A0.67361GIMAP50.530
12TFEC0.65962MYO1F0.530
13SLA0.65763KLHL60.530
14TNFSF13B0.65764GIMAP10.527
15C1orf1620.65665CD1630.524
16SAMSN10.65266CLEC7A0.522
17PLEK0.64967CCR10.518
18GMFG0.64768GBP50.517
19GIMAP40.64769NCF20.516
20SASH30.64570HLA-DPA10.516
21EVI2A0.63871RNASE60.515
22SRGN0.63872CD140.515
23AIF10.63673FAM26F0.511
24LAIR10.62774CD40.510
25FYB0.62575FCGR1A0.506
26FCER1G0.62376GZMA0.506
27MPEG10.62177GPR1830.505
28CD860.62178CD840.505
29C3AR10.61179NKG70.504
30C1QB0.60880C1QA0.502
31CD20.60681CD300LF0.500
32HCLS10.59982FPR30.499
33HCK0.59283PARVG0.496
34MNDA0.58784TRAF3IP30.494
35CD370.58785TYROBP0.492
36LY960.58586LPXN0.492
37CCR50.58587GIMAP80.492
38ARHGAP90.58088MS4A70.490
39CD520.58089IL2RB0.489
40GPR650.58090CD300A0.488
41GIMAP60.57891IGSF60.488
42SLAMF80.57792SELPLG0.488
43WIPF10.57793FCGR2A0.487
44MS4A4A0.57494NCKAP1L0.483
45ARHGAP150.57395DOK20.483
46HAVCR20.56796CD2470.481
47ARHGAP300.56697SELL0.480
48CLEC4A0.56698GZMK0.479
49TAGAP0.56499CCR20.479
50CYTIP0.563100LY860.479
The gene membership of the attractor provides hints about the underlying immune mechanism, which could be valuable towards generating hypotheses for potential immunotherapies such as adoptive transfer of lymphocytes. For example, the presence of the signal-transducing LCP2 (aka SLP-76) gene, together with the adaptor FYB (aka ADAP), suggests the formation of the SLP-76-ADAP adaptor module, which is known to regulate lymphocyte co-stimulation mediated by integrin ITGB2 (aka LFA-1), another prominent gene in the attractor [32].

Association of the Three Main Attractor Metagenes with Prognosis in Breast Cancer

We found that each of the above three main attractors under particular conditions is highly prognostic in breast cancer by analysing the METABRIC discovery breast cancer dataset [33] which includes both expression as well as survival data.

Mesenchymal transition attractor

In breast cancer, the mesenchymal transition attractor is expressed very early, as cancer becomes invasive. The presence of the attractor in a particular sample of high-stage tumor in not as informative, because of heterogeneity. On the other hand we found that the presence of the attractor in early-stage tumors is highly prognostic, consistent with the hypothesis that it indicates increased invasiveness. As shown in Figure 2, the Kaplan-Meier 15-year survival curves of the mesenchymal transition attractor using all samples does not show any significance. However, when we restrict the samples to early stage patients, defined as having no positive lymph nodes and tumor size less than 30 mm, the association between the attractor and survival became significant (P = 0.032).
Figure 2

Kaplan-Meier curve of mesenchymal transition attractor metagenes in breast cancer dataset.

The mesenchymal transition attractor metagene is most prominent in the early stage of breast cancer. The survival curve of the full dataset is insignificant (left). However, when the samples are restricted to only those at early stage (with no positive lymph nodes and tumor size less than 30 mm), the association between the mesenchymal transition attractor and the survival becomes significant (right), with P = 0.032.

Kaplan-Meier curve of mesenchymal transition attractor metagenes in breast cancer dataset.

The mesenchymal transition attractor metagene is most prominent in the early stage of breast cancer. The survival curve of the full dataset is insignificant (left). However, when the samples are restricted to only those at early stage (with no positive lymph nodes and tumor size less than 30 mm), the association between the mesenchymal transition attractor and the survival becomes significant (right), with P = 0.032.

Mitotic CIN attractor

The expression of the mitotic CIN attractor indicates that the tumor is dividing uncontrollably and therefore, in all cases, the attractor is highly prognostic for survival. The corresponding Kaplan-Meier 15-year survival curve (P<2×10−8) is shown in Figure 3. Furthermore, we ranked all the genes in terms of the concordance index [34] between the survival and the individual gene's expression value from the same rich dataset. Table 4 shows the top 100 genes, within which 47 (indicated by underline and boldface), including the top six, are also among the genes shown in Table 2 (P = 2×10−98 by Fisher's exact test). This extraordinary enrichment (also note that eight among the top ten genes, including the top three, are among the genes of Table 2) demonstrates that the members of the mitotic CIN attractor are, among all genes, the most prognostic ones, at least in breast cancer.
Figure 3

Kaplan-Meier curve of mitotic CIN attractor metagene in breast cancer dataset.

To evaluate the association between the mitotic CIN metagene expression and the 15-year survival, patients were divided into two groups: high mitotic CIN and low mitotic CIN. This binary expression level was determined by whether the mitotic CIN metagene expression value exceeding the mean of the patients. The P value of the association based on log-rank test is 1.8×10−8.

Table 4

List of top-ranked genes in terms of survival concordance index of the METABRIC discovery dataset demonstrating enrichment of the mitotic CIN attractor.

RankGene SymbolConcordance IndexRankGene SymbolConcordance Index
1 CDCA5 0.67051PRR110.639
2 AURKA 0.66352LOC6518160.638
3 KIF20A 0.66253KRT800.638
4TROAP0.66154C15orf420.637
5 UBE2C 0.65955SGOL10.637
6 AURKA 0.65856GPI0.637
7 FAM83D 0.65757 CEP55 0.637
8SHMT20.65558 MCM10 0.636
9 UBE2C 0.65559PKMYT10.635
10 CCNB2 0.65360 CENPN 0.635
11 TPX2 0.65361C20orf240.635
12 EXO1 0.65362SPC240.635
13ORC60.65363RIPK40.635
14 CENPA 0.65364TOMM400.634
15C1orf1060.65265 ANLN 0.634
16 GTSE1 0.65266ADRM10.634
17 MELK 0.65167 NCAPG 0.633
18STIP10.65168 CDCA8 0.633
19 SPC25 0.64969AIF1L0.633
20 CENPA 0.64970MRPS50.633
21GARS0.64971GPR560.633
22 MELK 0.64972PEX130.633
23UCK20.64873ENO10.633
24 HJURP 0.64874NUTF20.633
25 PTTG1 0.64775MEMO10.632
26CBX20.64676TXNRD10.632
27CCNE10.64677SLC7A50.631
28 PLK1 0.64678 FOXM1 0.631
29 KIF2C 0.64579 KIF14 0.631
30 CCNA2 0.64580PPP1R14B0.631
31GMPSP10.64581 FAM54A 0.630
32 AURKB 0.64582C20orf240.630
33 BUB1 0.64483SGOL10.630
34 TRIP13 0.64384NUP930.630
35 FOXM1 0.64385ZNF6950.630
36 CDC20 0.64386 BIRC5 0.630
37LOC7310490.64287 CENPL 0.630
38POLQ0.64288SOX110.630
39GSK3B0.64289 KIF23 0.629
40CCNE10.64290SLC52A20.629
41 KIF4A 0.64191AIF1L0.629
42 PRC1 0.64192 FEN1 0.629
43LAD10.64193CDC25A0.629
44 FAM64A 0.64194 CDCA3 0.628
45SAPCD20.64195TMEM132A0.628
46 RACGAP1 0.64196 CENPE 0.628
47POLR2D0.64197NACC20.628
48CKAP2L0.64098 TTK 0.628
49 PTTG1 0.64099SNRPA10.628
50ECE20.639100MMP150.628

The 47 underlined genes are also among the top 100 genes of the mitotic CIN attractor (Table 2).

Kaplan-Meier curve of mitotic CIN attractor metagene in breast cancer dataset.

To evaluate the association between the mitotic CIN metagene expression and the 15-year survival, patients were divided into two groups: high mitotic CIN and low mitotic CIN. This binary expression level was determined by whether the mitotic CIN metagene expression value exceeding the mean of the patients. The P value of the association based on log-rank test is 1.8×10−8. The 47 underlined genes are also among the top 100 genes of the mitotic CIN attractor (Table 2).

Lymphocyte-specific attractor

We found the attractor to be strongly protective in ER-negative breast cancers. As shown in Figure 4, the Kaplan-Meier 15-year survival curve restricted to ER-negative reveals association with longer survival (P = 0.004). Although the precise underlying biological mechanisms are unclear, it appears that this effect is due to some type of immune system recruitment to fight cancer. Interestingly, however, this effect is reversed in the presence of multiple positive lymph nodes. Indeed, the corresponding Kaplan-Meier curve shown on the right side of Figure 4 when restricted to patients with more than five positive lymph nodes demonstrates that the presence of the signature is associated with bad prognosis. This reversal may be explained by the fact that the presence of the lymphocyte-specific signature when lymph nodes are already affected implies that the cancer has obtained a (devastating for the patient) tolerance to this type of immune system recruitment.
Figure 4

Kaplan-Meier curve of lymphocyte-specific attractor metagenes in breast cancer dataset.

For ER-negative patients, the expression of the attractor is highly protective (high expression implies longer survival, left). However, when multiple lymph nodes are already affected, the expression of the attractor has a reversed effect on survival. When we restrict the samples to those with more than five positive lymph nodes, higher expression of the lymphocyte-specific attractor implies shorter survival (right), although the association is not significant due to the limited number of samples (76).

Kaplan-Meier curve of lymphocyte-specific attractor metagenes in breast cancer dataset.

For ER-negative patients, the expression of the attractor is highly protective (high expression implies longer survival, left). However, when multiple lymph nodes are already affected, the expression of the attractor has a reversed effect on survival. When we restrict the samples to those with more than five positive lymph nodes, higher expression of the lymphocyte-specific attractor implies shorter survival (right), although the association is not significant due to the limited number of samples (76).

Other Attractors

Chr8q24.3 amplicon attractor

Amplification in chr8q24 is often considered to be associated with cancer because of the presence of the MYC (aka c-Myc) oncogene at location 8q24.21. Indeed, MYC is one of 157 genes in “amplicon 8q23-q24” previously identified [35] in an extensive study of the breast cancer “amplicome” derived from 191 samples. We found, however, that the core of the amplified genes occurs at location 8q24.3 and this is, in fact, our most prominent multi-cancer amplicon attractor. Core genes of the attractor are PUF60 (aka FIR), EXOSC4, SHARPIN, HSF1, BOP1, SLC52A2. It is known that PUF60 can repress c-Myc via its far upstream element (FUSE), although a particular isoform was found to have the opposite effect [36]. The other genes may also play important roles. For example, HSF1 (heat shock transcription factor 1) has been associated with cancer in various ways [37]. It was found [38] that HSF1 can induce genomic instability through direct interaction with CDC20, a key gene of the mitotic CIN attractor mentioned above (listed in Table 2). Furthermore, HSF1 was found [39] required for the cell transformation and tumorigenesis induced by the ERBB2 (aka HER2) oncogene (see subsequent discussion of HER2 amplicon) responsible for aggressive breast tumors. The top ten genes of the chr8q24.3 attractor, ranked by the average of the highest five values of mutual information (Materials and Methods), are shown in Table 5. Interestingly, as shown in one of the attractors in Supplementary Table S1, an aneuploidy attractor involving a whole arm amplification of chr8q is also occasionally present in multiple cancer types, and this 8q whole arm amplification is the most prominent such aneuploidy attractor among all chromosomes.
Table 5

List of top ten genes in the chr8q24.3 and HER2 amplicons.

chr8q24.3 HER2
RankGene SymbolAvg MI of Top 4 DatasetsRankGene SymbolAvg MI of Top 4 Datasets
1EXOSC40.7161PGAP30.794
2PUF600.6592ERBB20.793
3BOP10.6533STARD30.768
4SLC52A20.6394MIEN10.764
5SHARPIN0.6345GRB70.718
6HSF10.6166PSMD30.602
7FBXL60.6157GSDMB0.539
8CYC10.6088ORMDL30.498
9SCRIB0.5529MED240.414
10GPAA10.55110MED10.400
Furthermore, prognostic associations involving the 8q24.3 amplicon have recently been recognized in various cancers [40], [41].

Chr17q12 HER2 amplicon attractor

This amplicon is prominent in breast cancer [42] and we also found it present in some samples of ovarian cancer, but not as much in colon cancer. So we initially used the four datasets of breast and ovarian cancer for deriving the attractor. We found that ERBB2 (aka HER2), STAR3, GRB7 and PGAP3 were the top-ranked genes, consistent with their known presence in the amplicon. We also found that gene MIEN1 (aka C17orf37) was very highly ranked in the two datasets in which its probe set was present. MIEN1 has recently been identified as an important gene within the 17q12 amplicon in various cancers including prostate cancer [43]. Therefore, we augmented the choice of datasets to the following seven, of which MIEN1 is included in five: breast GSE2034, breast GSE32646, breast GSE36771, breast TCGA, ovarian GSE9891, ovarian GSE26193, ovarian TCGA. Table 5 shows the top ten genes ranked by the average of the top five scores of mutual information (Materials and Methods) in the seven datasets for each gene. The results suggest that the above-mentioned five genes, including MIEN1, are consistently strongly co-expressed, and therefore are likely “driver” genes in the amplicon. In addition to the narrow HER2 amplicon, it is known that sometimes a large amplicon extends to more than a million bases containing both HER2 as well as TOP2A (one of the genes of the mitotic CIN attractor) at 17q21 [44]. We have observed that TOP2A indeed appears among the top 50 genes in terms of its association with the attractor in breast cancer. HER2/TOP2A co-amplification has been linked with better clinical response to therapy.

Estrogen receptor breast cancer attractor

We found this attractor clearly present only in breast cancer, and therefore we derived it using six breast cancer datasets (GSE2034, GSE3494, GSE31448, GSE32646, GSE36771, breast TCGA). Table 6 shows the top 50 genes ranked by the average mutual information (Materials and Methods) in these datasets, revealing that genes AGR3, CA12, AGR2, GATA3, FOXA1, MLPH and TBC1D9 are strongly co-expressed with the estrogen receptor ESR1 in the attractor. However, this co-expression is not as uniform as in the other attractors. For example, the progesterone receptor PGR appear in the list, but in reality it can be underexpressed even if the estrogen receptor ESR1 is not.
Table 6

Top 50 genes of the estrogen receptor breast cancer attractor.

RankGene SymbolAvg MIRankGene SymbolAvg MI
1AGR30.84726ERBB40.393
2CA120.61627AR0.383
3FOXA10.61328P4HTM0.383
4GATA30.58529SLC44A40.380
5MLPH0.58030KDM4B0.375
6AGR20.57031GFRA10.374
7ESR10.54332MAPT0.370
8TBC1D90.54033MYB0.364
9XBP10.46034DACH10.359
10ANXA90.45635SLC7A80.359
11PRR150.45236MAGED20.358
12SCUBE20.44437FBP10.357
13FSIP10.43838SLC22A50.355
14TFF30.42939CMBL0.346
15SPDEF0.42940DYNLRB20.346
16NAT10.42841C6orf2110.342
17ABAT0.42342GREB10.342
18CCDC1700.42243SIDT10.338
19DNALI10.41844TTC39A0.330
20DEGS20.41545FAM214A0.326
21DNAJC120.41146IL6ST0.324
22SLC39A60.40647CXXC50.323
23CAPN80.39948ACADSB0.323
24TFF10.39749CELSR10.322
25THSD40.39550CLSTN20.322

Comparison with Other Unsupervised Algorithms

The scope of the algorithm identifying attractor metagenes is different from that of other unsupervised methods, which are usually aimed at identifying subtypes or mutually exclusive clusters. Nevertheless, it is interesting to find to what extent other algorithms can produce multiple cancer signatures each of which appears in nearly identical form across different types. We applied three widely used methods, k-means clustering, principal component analysis and hierarchical clustering on the six cancer datasets used in this paper. In all cases, we listed the top fifty genes in each cluster and applied the same clustering algorithm as in the main text to find common genes among them and group them together. The results are shown in Supplementary Text S2 and Supplementary Tables S5, S6, S7. We found that, in all cases, these well-established methods cannot identify multiple universal metagenes common in all six tested datasets.

Using Attractor Metagenes as Proxies of Biomolecular Events

A biomolecular event, whether it is present in multiple cancer types or it is cancer specific, can be represented by a “consensus attractor metagene” after analyzing multiple datasets. To generate such consensus attractors, we use genes that were profiled by at least three of the six datasets, then rank individual genes in terms of their average mutual information (Materials and Methods) with the corresponding attractor metagenes across all datasets in which the gene was profiled. For example, Figure 5 contains scatter plots from four different rich breast cancer datasets connecting the mitotic CIN and estrogen receptor attractors. It has previously been reported [45] that breast tumors with high chromosomal instability are predominantly of the estrogen receptor negative phenotype. Although these scatter plots cannot be used for precise conclusions, they do suggest in all cases that ER-negative tumors have high mitotic chromosomal instability (or equivalently that low chromosomal instability implies that the tumor is ER-positive). The reverse relationship, however, is not as clear.
Figure 5

Scatter plots demonstrating the relationship between mitotic CIN attractor and estrogen receptor attractor in breast cancer.

The two metagenes were defined to be “consensus attractors” after ranking individual genes in terms of their average mutual information with the corresponding attractor metagenes, across all datasets, and selecting the genes having average mutual information greater than 0.5. These criteria led to 59 genes in the consensus mitotic CIN attractor (the top 59 genes in Table 2), and AGR3, ESR1, CA12, AGR2, GATA3, FOXA1, MLPH and TBC1D9 (the top eight genes in Table 6) in the consensus estrogen receptor breast cancer attractor. These scatter plots reveal that ER-negative breast tumors have high mitotic chromosomal instability, but not necessarily vice versa.

Scatter plots demonstrating the relationship between mitotic CIN attractor and estrogen receptor attractor in breast cancer.

The two metagenes were defined to be “consensus attractors” after ranking individual genes in terms of their average mutual information with the corresponding attractor metagenes, across all datasets, and selecting the genes having average mutual information greater than 0.5. These criteria led to 59 genes in the consensus mitotic CIN attractor (the top 59 genes in Table 2), and AGR3, ESR1, CA12, AGR2, GATA3, FOXA1, MLPH and TBC1D9 (the top eight genes in Table 6) in the consensus estrogen receptor breast cancer attractor. These scatter plots reveal that ER-negative breast tumors have high mitotic chromosomal instability, but not necessarily vice versa.

Discussion

Gene expression analysis has resulted in several cancer types being further classified into subtypes labeled, e.g. as “mesenchymal” or “proliferative.” Such characterizations, however, may sometimes simply reflect the presence of the mesenchymal transition attractor or the mitotic chromosomal instability attractor, respectively, in some of the analyzed samples. Similar subtype characterizations across cancer types often share several common genes, but the consistency of these similarities has not been significantly high. By contrast, using an unconstrained algorithm independent of subtype classification or dimensionality reduction, we identified several attractors exhibiting remarkable consistency across many cancer types, suggesting that each of them represents a precise biological phenomenon present in multiple cancers. We found that the mesenchymal transition attractor is significantly present only in samples whose stage designation has exceeded a threshold, but not in all of such samples. On the other hand, the absence of the mesenchymal transition attractor in a profiled high-stage sample (or the absence of the mitotic chromosomal instability attractor in a profiled high-grade sample) does not necessarily mean that the attractor is not present in other locations of the same tumor. Indeed, it is increasingly appreciated [46] that tumors are highly heterogeneous. Therefore it is possible for the same tumor to contain components, in which, e.g., some are migratory having undergone mesenchymal transition, some other ones are highly proliferative, etc. If so, attempts for subtype classification based on one particular site in a sample may be confusing. Existing molecular marker products make use of multigene assays that have been derived from phenotypic associations in particular cancer types. For breast cancer, biomarkers such as Oncotype DX [47] and Mammaprint [48] contain several genes highly ranked in our attractors. For example, most of the genes used for the Oncotype DX breast cancer recurrence score directly converge to one of our identified attractors: MMP11 to the mesenchymal transition attractor; MKI67 (aka Ki-67), AURKA (aka STK15), BIRC5 (aka Survivin), CCNB1, and MYBL2 to the mitotic CIN attractor; CD68 to the lymphocyte-specific attractor; ERBB2 and GRB7 to the HER2 amplicon attractor; and ESR1, SCUBE2, PGR to the estrogen receptor attractor. We envision, instead, a multi-cancer biomarker product that will include detection of the level of expression of each of the key attractor metagenes. These levels would need to be combined in different ways in different cancer types, but each of the metagenes would indicate the same attribute and the contributions of each component will be cleanly distinguished. Even though molecular marker genes in some existing products are already separated into groups that are related to our attractor designation, any improvement in diagnostic, prognostic, or predictive accuracy resulting from better such group designation and better choice of genes in each group would be highly desirable. We hope that the identification of the attractors of cancer, as presented here, will be valuable in that regard.

Materials and Methods

The full code of the attractor finding algorithm is publicly available in the Sage Bionetworks Synapse platform at https://synapse.sagebase.org/#Synapse:syn1446295. In addition, we provide a pseudo-code in Supplementary Text S3.

General Attractor Finding Algorithm

We chose the association measure between genes to be a power function with exponent a of a normalized estimated information theoretic measure of the mutual information [49] with minimum value 0 and maximum value 1 (see “Mutual information estimation” below; more sophisticated related association measures [50] can also be used, but computational complexity will be prohibitive). In other words, , in which the exponent can be any nonnegative number. The value of is set to zero if the Pearson correlation between the two genes is negative. Each iteration defines a new metagene in which the weight for gene is equal to where is the immediately preceding metagene. The process is repeated until the magnitude of the difference between two consecutive weight vectors is less than a threshold, which we chose to be equal to 10−7. At one extreme, if is sufficiently large then each of the seeds will create its own single-gene attractor because all other genes will always have near-zero weights. In that case, the total number of attractors will be equal to the number of genes. At the other extreme, if is zero then all weights will remain equal to each other representing the average of all genes, so there will only be one attractor. The higher the value of , the “sharper” (more focused on its top gene) each attractor will be and the higher the total number of attractors will be. As the value of is gradually decreased, the attractor from a particular seed will transform itself, occasionally in a discontinuous manner, thus providing insight into potential related biological mechanisms. We empirically found that an appropriate choice of (in the sense of maximizing the strength of the attractor, as defined below) for general attractors is around 5, in which case there will typically be approximately 50 to 150 resulting attractors, each resulting from many attractee genes. An alternative to the power function can be a sigmoid function with varying steepness, but we found that the consistency of the resulting attractors was worse in that case. As mentioned in the Introduction, an attractor metagene can also be interpreted as a set of the top genes of the attractor, i.e., a gene set that includes only the genes that are significantly associated with the attractor. One empirical choice for such a gene set would be to include only the genes whose mutual information (or the z-score thereof) with the attractor metagene exceeds a particular threshold. In fact, the attractor finding algorithm itself can be designed to discover “attractor gene sets,” without assigning weights to genes. In that case, metagenes are defined as simple averages of the genes in a gene set, and each iteration leads to a new gene set consisting of the new set of top-ranked genes in terms of their association with the previous metagene (gene set sizes can be constant or adaptively changing in various ways). We found, however, that such a method has the disadvantage of occasionally leading to attractors with significant overlap, which requires additional post-processing steps. Identified attractors can be ranked in various ways. The “strength of an attractor” can be defined as the mutual information between the n th top gene of the attractor and the attractor metagene. Indeed, if this measure is high, this implies that at least the top n genes of the attractor are strongly co-expressed. We selected n = 10 as a reasonable choice, not too large, but sufficiently so to represent a real complex biological phenomenon of co-expression of at least ten genes. For amplicons, n = 5 is sufficient to ensure that the oncogenes are included in the co-expression). We use these choices when referring to the strength of an attractor. The top genes of many among the found attractors are genomically localized. In that case the biomolecular event that they represent is often the presence of a particular copy number variation. In the cancer datasets that we tried, this phenomenon almost always corresponds to a local amplification event known as an amplicon. We therefore also devised a related amplicon-finding algorithm, custom-designed to identify localized amplicon-representing attractor metagenes, described below.

Genomically Localized Attractor Finding Algorithm

To identify genomically localized attractors – almost always amplicons – we use the same algorithm but for each seed gene we restrict the set of candidate attractor genes to only include those in the local genomic neighbourhood of the gene, and we optimize the exponent a so that the strength of the attractor is maximized. Specifically, we sort the genes in each chromosome in terms of their genomic location and we only consider the genes within a window of size 51, i.e., with 25 genes on each side of the seed gene. We further optimize the choice of the exponent for each seed, by allowing to range from 1.0 to 6.0 with step size of 0.5 and selecting the attractor with the highest strength. Because the set of allowed genes is different for each seed, the attractors will be different from each other, but “neighbouring” attractors will usually be very similar to each other. Therefore, following exhaustive attractor finding from each seed gene in a chromosome, we apply a filtering algorithm to only select the highest-strength attractor in each local genomic region, as follows: For each attractor, we rank all the genes in terms of their mutual information with the corresponding attractor metagene and we define the range of the attractor to be the chromosomal range of its top 15 genes. If there is any other attractor with overlapping range and higher strength, then the former attractor is filtered out. This filtering is done in parallel, so elimination of attractors occurs simultaneously.

Mutual Information Estimation

Assuming that the continuous expression levels of two genes and are governed by a joint probability density with corresponding marginal and , the mutual information is defined as the expected value of . It is a non-negative quantity representing the information that each one of the variables provides about the other. The pairwise mutual information has successfully been used as a general measure of the correlation between two random variables. We compute mutual information with a spline-based estimator [51] using six bins in each dimension. This method divides the observation space into equally spaced bins and blurs the boundaries between the bins with spline basis functions using third-order B-splines. We further normalize the estimated mutual information by dividing by the maximum of the estimated and , so the maximum possible value of is 1.

Pre-processing Gene Expression Datasets

We used Level 3 data when directly available, and imputed missing values using a k-nearest-neighbour algorithm with k = 10, as implemented in R [52]. We normalized the other datasets on the Affymetrix platform using the RMA algorithm as implemented in the affy package in Bioconductor [53]. To avoid biasing attractor convergence with multiple correlated probe sets of the same gene, we summarized the probe set-level expression values into the gene-level expression values by taking the mean of the expression values of probe sets for the same genes. We used the annotations for the probe sets given in the jetset package [54]. To investigate the associations between the attractor metagene expression and the tumor stage and grade, we used the following annotated gene expression datasets. For stage association: Breast (GSE3893), TCGA Ovarian, Colon (GSE14333). For grade association: Breast (GSE3494), TCGA Ovarian, Bladder (GSE13507). For Breast GSE3494 we used only the samples profiled by U133A arrays. For Breast GSE3893 we combined two platforms by taking the intersections of the probes in the U133A and the U133Plus 2.0 arrays. For datasets profiled by Affymetrix platforms all the datasets were normalized using the RMA algorithm. For Bladder GSE13507 normalization was done as provided in the GEO.

P Value Evaluation

P values for gene set enrichment were evaluated with the cumulative hypergeometric distribution using the total number of genes in each dataset. The significance of the consistency of the mesenchymal transition and mitotic CIN attractors was evaluated as follows: Supplementary Table S1 contains 210 gene sets from six cancer datasets. Each of the gene sets contains 50 genes. The mesenchymal transition metagene has eight genes (COL5A2, COL1A2, SPARC, CTSK, FBN1, VCAN, AEBP1, SERPINF1) common across all six datasets. The mitotic CIN metagene has 13 common genes (CENPA, DLGAP5, KIF2C, CCNB2, MELK, CCNA2, KIF20A, HJURP, NUSAP1, BUB1, TTK, KIF11, NCAPH) across all six datasets. To evaluate the significance of the consistency across the six datasets, we randomly generated 210 gene sets with the same sizes as those in the Table. In other words, we randomly selected 50 genes out of the 11,395 common genes to generate a random gene set. We created 210 such random gene sets, and then assigned them to six different datasets according to the settings in the Table. We then performed the clustering algorithm described in Materials and Methods. Each time, we counted the maximum number of genes common in all six datasets, and we repeated this process ten million times. This constitutes a conservative way of evaluating consistency, in the sense that for each random experiment we only record the maximum number of common genes in the gene set cluster, and we created random gene sets using only the common genes in the six datasets. In these ten million experiments, it never occurred that more than one gene was common in all six datasets. Therefore, the corresponding P value for both the mesenchymal transition metagene as well as the mitotic CIN metagene is less than 10−7, and is in fact much lower than that given the large number (8 and 13 respectively) of the common genes. General attractors identified from the six datasets. (XLS) Click here for additional data file. Genomically localized attractors identified from the six datasets. (XLS) Click here for additional data file. Association of mesenchymal transition attractor with tumor stage. (XLS) Click here for additional data file. Association of mitotic CIN attractor with tumor grade. (XLS) Click here for additional data file. Common clusters from the six datasets using k-means. (XLS) Click here for additional data file. Common clusters from the six datasets using principal component analysis. (XLS) Click here for additional data file. Common clusters from the six datasets using hierarchical clustering. (XLS) Click here for additional data file. Datasets and methods used to derive Supplementary Tables S1, S2, S3, S4. (DOCX) Click here for additional data file. Comparison with other unsupervised algorithms. (DOCX) Click here for additional data file. Pseudo-code for attractor metagene finding algorithm. (DOCX) Click here for additional data file.
  52 in total

Review 1.  From signatures to models: understanding cancer using microarrays.

Authors:  Eran Segal; Nir Friedman; Naftali Kaminski; Aviv Regev; Daphne Koller
Journal:  Nat Genet       Date:  2005-06       Impact factor: 38.330

Review 2.  An overview of epithelio-mesenchymal transformation.

Authors:  E D Hay
Journal:  Acta Anat (Basel)       Date:  1995

3.  A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers.

Authors:  Scott L Carter; Aron C Eklund; Isaac S Kohane; Lyndsay N Harris; Zoltan Szallasi
Journal:  Nat Genet       Date:  2006-08-20       Impact factor: 38.330

4.  repp86: A human protein associated in the progression of mitosis.

Authors:  Hans-Juergen Heidebrecht; Sabine Adam-Klages; Monika Szczepanowski; Marc Pollmann; Friedrich Buck; Elmar Endl; Marie-Luise Kruse; Pierre Rudolph; Reza Parwaresch
Journal:  Mol Cancer Res       Date:  2003-02       Impact factor: 5.852

5.  SLP-76-ADAP adaptor module regulates LFA-1 mediated costimulation and T cell motility.

Authors:  Hongyan Wang; Bin Wei; Georges Bismuth; Christopher E Rudd
Journal:  Proc Natl Acad Sci U S A       Date:  2009-07-15       Impact factor: 11.205

6.  A novel function for HSF1-induced mitotic exit failure and genomic instability through direct interaction between HSF1 and Cdc20.

Authors:  Y J Lee; H J Lee; J S Lee; D Jeoung; C M Kang; S Bae; S J Lee; S H Kwon; D Kang; Y S Lee
Journal:  Oncogene       Date:  2007-12-03       Impact factor: 9.867

7.  Jetset: selecting the optimal microarray probe set to represent a gene.

Authors:  Qiyuan Li; Nicolai J Birkbak; Balazs Gyorffy; Zoltan Szallasi; Aron C Eklund
Journal:  BMC Bioinformatics       Date:  2011-12-15       Impact factor: 3.169

8.  Subtypes of pancreatic ductal adenocarcinoma and their differing responses to therapy.

Authors:  Eric A Collisson; Anguraj Sadanandam; Peter Olson; William J Gibb; Morgan Truitt; Shenda Gu; Janine Cooc; Jennifer Weinkle; Grace E Kim; Lakshmi Jakkula; Heidi S Feiler; Andrew H Ko; Adam B Olshen; Kathleen L Danenberg; Margaret A Tempero; Paul T Spellman; Douglas Hanahan; Joe W Gray
Journal:  Nat Med       Date:  2011-04-03       Impact factor: 53.440

9.  CENPA overexpression promotes genome instability in pRb-depleted human cells.

Authors:  Angela Amato; Tiziana Schillaci; Laura Lentini; Aldo Di Leonardo
Journal:  Mol Cancer       Date:  2009-12-10       Impact factor: 27.401

10.  Heat shock factor 1 is a powerful multifaceted modifier of carcinogenesis.

Authors:  Chengkai Dai; Luke Whitesell; Arlin B Rogers; Susan Lindquist
Journal:  Cell       Date:  2007-09-21       Impact factor: 41.582

View more
  51 in total

1.  Immunological metagene signatures derived from immunogenic cancer cell death associate with improved survival of patients with lung, breast or ovarian malignancies: A large-scale meta-analysis.

Authors:  Abhishek D Garg; Dirk De Ruysscher; Patrizia Agostinis
Journal:  Oncoimmunology       Date:  2015-08-12       Impact factor: 8.110

2.  JAK2 expression is associated with tumor-infiltrating lymphocytes and improved breast cancer outcomes: implications for evaluating JAK2 inhibitors.

Authors:  Chris P Miller; Jason D Thorpe; Amanda N Kortum; Catherine M Coy; Wei-Yi Cheng; Tai-Hsien Ou Yang; Dimitris Anastassiou; J David Beatty; Nicole D Urban; C Anthony Blau
Journal:  Cancer Immunol Res       Date:  2014-01-15       Impact factor: 11.151

Review 3.  Principles and methods of integrative genomic analyses in cancer.

Authors:  Vessela N Kristensen; Ole Christian Lingjærde; Hege G Russnes; Hans Kristian M Vollan; Arnoldo Frigessi; Anne-Lise Børresen-Dale
Journal:  Nat Rev Cancer       Date:  2014-05       Impact factor: 60.716

4.  Data-Driven Tree Transforms and Metrics.

Authors:  Gal Mishne; Ronen Talmon; Israel Cohen; Ronald R Coifman; Yuval Kluger
Journal:  IEEE Trans Signal Inf Process Netw       Date:  2017-08-23

5.  PRMT2 and RORγ expression are associated with breast cancer survival outcomes.

Authors:  Tae Gyu Oh; Peter Bailey; Eloise Dray; Aaron G Smith; Joel Goode; Natalie Eriksson; John W Funder; Peter J Fuller; Evan R Simpson; Wayne D Tilley; Peter J Leedman; Christine L Clarke; Sean Grimmond; Dennis H Dowhan; George E O Muscat
Journal:  Mol Endocrinol       Date:  2014-06-09

6.  FBXO31 protects against genomic instability by capping FOXM1 levels at the G2/M transition.

Authors:  J M Jeffery; M Kalimutho; P Johansson; D G Cardenas; R Kumar; K K Khanna
Journal:  Oncogene       Date:  2016-08-29       Impact factor: 9.867

Review 7.  Beyond cytokinesis: the emerging roles of CEP55 in tumorigenesis.

Authors:  J Jeffery; D Sinha; S Srihari; M Kalimutho; K K Khanna
Journal:  Oncogene       Date:  2015-04-27       Impact factor: 9.867

Review 8.  Machine Learning in Medicine.

Authors:  Rahul C Deo
Journal:  Circulation       Date:  2015-11-17       Impact factor: 29.690

9.  Metabolite profiling stratifies pancreatic ductal adenocarcinomas into subtypes with distinct sensitivities to metabolic inhibitors.

Authors:  Anneleen Daemen; David Peterson; Nisebita Sahu; Ron McCord; Xiangnan Du; Bonnie Liu; Katarzyna Kowanetz; Rebecca Hong; John Moffat; Min Gao; Aaron Boudreau; Rana Mroue; Laura Corson; Thomas O'Brien; Jing Qing; Deepak Sampath; Mark Merchant; Robert Yauch; Gerard Manning; Jeffrey Settleman; Georgia Hatzivassiliou; Marie Evangelista
Journal:  Proc Natl Acad Sci U S A       Date:  2015-07-27       Impact factor: 11.205

10.  Covariance-based analyses of biological pathways.

Authors:  P Danaher; D Paul; P Wang
Journal:  Biometrika       Date:  2015-09-01       Impact factor: 2.445

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.