Literature DB >> 23567845

The network organization of cancer-associated protein complexes in human tissues.

Jing Zhao¹, Sang Hoon Lee, Mikael Huss, Petter Holme.

Abstract

Differential gene expression profiles for detecting disease genes have been studied intensively in systems biology. However, it is known that various biological functions achieved by proteins follow from the ability of the protein to form complexes by physically binding to each other. In other words, the functional units are often protein complexes rather than individual proteins. Thus, we seek to replace the perspective of disease-related genes by disease-related complexes, exemplifying with data on 39 human solid tissue cancers and their original normal tissues. To obtain the differential abundance levels of protein complexes, we apply an optimization algorithm to genome-wide differential expression data. From the differential abundance of complexes, we extract tissue- and cancer-selective complexes, and investigate their relevance to cancer. The method is supported by a clustering tendency of bipartite cancer-complex relationships, as well as a more concrete and realistic approach to disease-related proteomics.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Multiprotein Complexes

Year: 2013 PMID： 23567845 PMCID： PMC3620901 DOI： 10.1038/srep01583

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Genome sequencing can, at least in an idealized world, list the repertoire of what a cell could possibly do; expression profiling, on the other hand, reflects what the cell actually is doing. Selective or differential gene expression profiles in specific cells, therefore, add valuable contextual information. It is quite natural to connect the differential gene expression profiles to disease states, whether they are genetic diseases or not. An overwhelming number of studies in this vein have been published: e.g., Refs. 1,2,3,4,5,6 to name just a few. Essentially all of these approaches make the assumption that genes are the units of biological functionality. Even if the assumption cannot be denied, it has recently been pointed out that the relationships among proteins, not just properties of individual proteins, are essential ingredients in characterizing the entity of biological functions. The relationships can be binary protein-protein interactions (PPIs)78910 or formation of stable structural and functional units called protein complexes1112131415. Proteins tend to function as members of complexes, and dysfunctions of different proteins in the same complex generally lead to similar disorders. Research has been conducted trying to identify disease-associated protein-protein interactions, signaling pathways and protein complexes by the integrated computational analysis of heterogeneous data sources16171819202122. Human diseases usually occur in one or more specific tissues and organs, while different types of organs and tissues make use of selective sets of expressed genes, protein-protein interactions and protein complexes23. Genes predominantly expressed in one or a few biologically similar tissue types are defined as tissue-selective genes24. Similarly, protein complexes showing significantly higher abundance levels in one or limited tissues are considered as tissue-selective complexes. Tissue-selective genes and complexes could be disease markers and potential drug targets. Although many approaches have been developed to identify tissue-selective genes and their relationships to diseases242526272829, the identification of tissue- and disease-selective complexes is still in its infancy due to the lack of adequate coverage on experimental proteomic data, so that gene expression levels have been used instead of protein abundance203031. In this paper, by using the optimization algorithm for estimating differential abundance levels of protein complexes introduced in Ref. 15, we attempt to define the human tissue- and cancer-selective protein complexes. More specifically, we use the recently released E-MTAB-62 gene expression profile dataset32 and focus on 39 solid tissue cancers and 25 different normal tissues from some of which the cancers are originated (Table 1). From the abundance profiles of complexes, we classify the complexes associated with cancers and tissues into four different categories called Patterns 1–4, where the complexes over-expressed in cancers but under-expressed in originated normal tissues are considered as most relevant and analyzed in terms of the bipartite relation between cancers and complexes. Finally, we show that the correlation structures of different cancers and tissues are preserved in our complex-based study, in comparison to the results from individual gene expression levels.

Table 1

List of solid cancers and their originated normal tissues. Cancers were selected from the file “E-MTAB-62.sdrf.txt” whose columns “Characteristics [4 meta-groups]” and “Characteristics [Blood/NonBlood meta-groups]” are “neoplasm” and “non blood”, respectively. Cancer name and its originated normal tissue are taken from “Characteristics [DiseaseState]” and “Characteristics [OrganismPart]” of the file, respectively

Cancers	Originated normal tissue
Liposarcoma, Myxoid liposarcoma	adipose tissue
Bladder cancer	bladder
Chondroblastoma, Chordoma, Ewings sarcoma, Osteosarcoma, Spindle cell tumor	bone
Brain tumor, Ganglioneuroblastoma, Ganglioneuroma, Glioblastoma, Malignant peripheral nerve sheath tumor, Neuroblastoma, Neurofibroma, Schwannoma	brain
Chondromyxoid fibroma, Chondrosarcoma, Dedifferentiated chondrosarcoma, Fibromatosis, Monophasic synovial sarcoma, Sarcoma	connective tissue
Esophageal adenocarcinoma	esophagus
Oral squamous cell carcinoma	hypopharynx
Kidney carcinoma, Renal cell carcinoma	kidney
Hepatocellular carcinoma	liver
Lung cancer	lung
Uterine tumor	myometrium
Head and neck squamous cell carcinoma	hypopharynx
Ovarian tumor	ovary
Prostate cancer	prostate
Acute quadriplegic myopathy	skeletal muscle
Kaposi sarcoma	skin
Alveolar rhabdomyo sarcoma, Embryonal rhabdomyo sarcoma, Leiomyosarcoma	smooth muscle
Germ cell tumor	testis
Thyroid adenocarcinoma	thyroid

Results

Differentially expressed protein complexes in normal tissues

First, we present our results of the differentially expressed protein complexes in normal tissues. For each of 25 solid tissues under study, using the average abundance levels over all the other tissues as the control set, we extracted over (under)-expressed complexes with a change more than a factor two, or less than a factor 1/2 (Table S1 and S2). A total of 106 and 209 distinct protein complexes were found over- and under-expressed in normal tissues, respectively. See Table S3 for the number of complexes differentially expressed in each tissue. The distributions of the number of different tissues in which complexes are over- or under-expressed are shown in Fig. 1. It can be seen that most complexes are over- and under-expressed only in a small number of tissues, suggesting that a large fraction of complexes predicted by our method exhibits a high extent of tissue selectivity. Note that the tissues are (of course) not completely independent from one another, which may be responsible for some multiple numbers of tissues in which complexes are differentially expressed.

Figure 1

Distributions of number of overlapped tissues for over-expressed (a) and under-expressed (b) complexes, in normal tissues.

For each over- or under-expressed complex in normal tissues, we count the number of tissues where it is over- or under-expressed and define the number as the number of overlapped tissues.

In the CORUM (Comprehensive Resource of Mammalian protein complexes33) database, which we use for our complex list, functions of protein complexes are annotated by the Functional Catalogue (FunCat) scheme, whose hierarchical structure allows browsing for protein complexes with particular cellular functions or localizations3334. However, among all the 2837 mammalian protein complexes in the CORUM database, only 148 have information concerning specific animal tissue of the complex. Because of this lack of tissue-specific annotation, only 5 of the 106 over-expressed complexes predicted by our method have tissue annotation. As shown in Table 2, among the 5 complexes, 4 complexes are consistent with the annotation, suggesting the validity of our result. For instance, “thymus” (our predicted tissue) and “bone marrow” (CORUM) are compatible, as both of those are hot spots of T cell production and maturation35. They are both considered (the only) “primary lymphoid organs”35.

Table 2

Comparison of our results with tissue information of complexes in CORUM. Boldface marks consistent results

complex name	tissue information in CORUM	over-expressed predicted tissue
KCNQ1 macromolecular complex	muscle and heart muscle	adipose tissue
		bone
		brain
		heart
		liver
		smooth muscle
		testis
RICH1-PAR3-aPKC polarity complex	epithelium	adipose tissue
		hypopharynx
		lymph node
		skeletal muscle
		skin
SMAD3-SMAD4-FOXO3-FOXG1 complex	epithelium	connective tissue
		eye
		skin
		thymus
		thyroid
		ovary
PKC-alpha-PLD1-PLC-gamma-2 signaling complex, lacritin stimulated	epithelium	tonsil
YY1-Notch1 complex	bone marrow	thymus

Differentially expressed protein complexes in solid cancers

As in the normal tissue case, for each of 39 solid tissue cancers, using the abundance levels in the originated normal tissue as the control set, we extract over(under)-expressed complexes with more (less) than 2-fold (1/2-fold) changes, respectively (Tables S4 and S5). A total of 283 and 294 distinct complexes were identified over- and under-expressed in the cancers, respectively. We call these complexes cancer-associated complexes. Again, from the distributions of the number of different cancers in which complexes are over- or under-expressed, shown in Fig. 2, we can observe the high degree of cancer selectivity of the complexes. The fact that several cancers are derived from the same normal tissues seems to be responsible for the larger number of overlapped cancers compared to the number of overlapped normal tissues in Fig. 1, and in fact, such cancer-cancer correlations will be presented later.

Figure 2

Distributions of number of overlapped cancers for over-expressed (a) and under-expressed (b) complexes, in cancers.

For each over- or under-expressed complex in cancers, we count the number of tissues where it is over- or under-expressed and define the number as the number of overlapped cancers.

The most fundamental assumption of our approach is to treat the complexes as a functional unit, instead of individual component proteins. In other words, differential abundance profiles for complexes are more relevant than the ones for individual genes, since each gene may play different functional roles in different complexes, resulting in the situation that expression levels over different contexts are effectively “averaged out.” In Table 3, we compare over-expressed protein complexes of brain tumor with their up-regulated component genes which were shown associated with nerve system cancers in GeneCards36. We use the t-test to test if a gene is differentially expressed in the brain tumor and control samples. For such a large number of genes being simultaneously tested, the FDR37 corrected p-values are used for screening differentially expressed genes. We consider genes with at least 2-fold change of log ratio for average expression level and FDR at most 0.05 as up-regulated in brain tumor. It can be seen that in complexes identified over-expressed in brain tumor by our algorithm, only a small fraction of component genes associated with nerve system cancers was up-regulated. Such a large difference is strong evidence supporting the fundamental assumption of complexes' relevance to biological functions and dysfunctions compared to individual genes.

Table 3

Comparison of protein complexes over-expressed in brain tumor and their up-regulated component genes

Complex name	Also identified by GSEA	Number of genes in complex	Percentage of up-regulated genes in complex	Number of genes associated with nerve system cancers	Percentage of up-regulated genes in subset of genes associated with nerve system cancers
SMN complex, U7 snRNA specific	Y	5	100%	0	-
CDC2-CCNA2 complex	Y	2	50%	2	50%
VEGF transcriptional complex	Y	5	40%	1	100%
Cell cycle kinase complex CDK5	Y	5	40%	5	40%
Anti-HDAC2 complex	Y	17	17.65%	7	42.86%
Emerin complex 52	Y	23	17.39%	8	25%
RC complex during S-phase of cell cycle	Y	13	7.69%	7	0
WINAC complex	Y	14	7.14%	6	16.67%
EIF3 complex (EIF3B, EIF3G, EIF3I)	Y	3	0	0	-
CAV1-VDAC1-ESR1 complex	N	3	33.33%	3	33.33%
SMURF2-SMAD3-SnoN complex, TGF(beta)-dependent	N	3	33.33%	2	0
VHL-TBP1-HIF1A complex	N	3	33.33%	2	50%
RAF1-RAS complex, EGF induced	N	4	25%	4	25%
P2X7 receptor signalling complex	N	11	9.09%	5	20%
RNA polymerase II complex, chromatin structure modifying	N	18	0	5	0
MRN-TRRAP complex (MRE11A-RAD50-NBN-TRRAP complex)	N	4	0	2	0
PLC-gamma-1-SLP-76-SOS1-LAT complex	N	4	0	2	0
PlexinA1-NRP1-SEMA3A complex	N	3	0	2	0
SMARCA2/BRM-BAF57-MECP2 complex	N	3	0	2	0
TRAP complex	N	15	0	2	0
APP-TIMM23 complex	N	2	0	1	0
BCL6-HDAC7 complex	N	2	0	1	0
DNA polymerase alpha-primase complex	N	4	0	1	0
MCM8-ORC2-CDC6 complex	N	2	0	1	0
RICH1-PAR3-aPKC polarity complex	N	3	0	1	0
APLG1-Rababtin5 complex	N	3	0	0	0
BLM-TOP3A complex	N	2	0	0	-
CTF18-CTF8-DCC1-RFC3 complex	N	2	0	0	-
FEN1-9-1-1 complex	N	4	0	0	-
Kinase-scaffold-phosphatase complex, PKA-AKAP79-CaN	N	3	0	0	-
PPP4C-PPP4R2-Gemin3-Gemin4 complex	N	3	0	0	-
Retrotranslocation complex	N	2	0	0	-
RFC2-RIalpha complex	N	2	0	0	-
TRAP-SMCC mediator complex	N	7	0	0	-

Considering that the database E-MTAB-62 we used is an integration of data generated in different laboratories, we conducted a within-laboratory comparison on over-expressed complexes in brain tumor to see to what extent our result is replicated across studies. The samples of brain tumor and normal brain tissue came from 2 and 6 different laboratories, respectively. By combining brain tumor samples from one lab with normal brain samples from another lab, we got 12 different sample sets. We ran our algorithm on each sample set and identified complexes over-expressed in brain tumor. As shown in Figure 3, most complexes identified by our algorithm are also identified by at least half of the sample sets. Then we ran our algorithm on each of the brain tumor and normal brain tissue samples, respectively. By t-test and multiple testing corrections on the resulting complex abundance matrix of large samples, we identify complexes statistically over-expressed in brain tumor with FDR < 0.05. A total of 29 complexes identified over-expressed by this sample replication method are also identified by our method which used the average of samples as input (See Figure 3). These comparisons suggest the robustness of our algorithm on different data resources.

Figure 3

Cross-validation of complexes over-expressed in brain tumor identified by our method by within-laboratory comparison, sample replication method and GSEA.

We also compare our algorithm to a gene set testing approach, the Gene Set Enrichment Analysis (GSEA)38. Using the CORUM complexes as gene sets, we conducted GSEA analysis on expression data of brain tumor and normal brain tissue. This method identifies 227 complexes that were significantly enriched in brain tumor tissue (FDR < 25%). As shown in Figure 3 and Table 3, 9 of the 34 complexes over-expressed in brain tumor identified by our method are also identified by GSEA. From Table 3 we see that relatively more up-regulated genes appeared in the overlapped complexes, which is the principle of identifying enriched gene sets by GSEA. Complexes identified over-expressed only by our algorithm include genes reported associated with nerve system cancers, suggesting they may related with brain tumor. However, these complexes are not detected by GSEA because few genes were up-regulated. This comparison suggests that our algorithm, which considers stoichiometry of complexes from global point of view, could add some new information in complex prediction. From Figure 3 we can see that several complexes, such as Anti-HDAC2 complex, SMN complex, EIF3 complex, CDC2-CCNA2 complex, are well identified over-expressed in brain tumor by all the four methods, suggesting strong expression signals of these complexes in brain tumor. Complexes such as CDC2-CCNA2 complex, Anti-HDAC2 complex and WINAC complex are more obviously associated with brain tumor due to their high fraction of component proteins related with nerve system cancer (see Table 3). However, from GeneCards and GoPubmed database, all the five component proteins of SMN complex (small nuclear ribonucleoprotein B, D, E, F, G) are not associated with nerve system cancer although they are highly associated with neurologic manifestations and neurodegenerative diseases. Our computations found this complex and its five component proteins are significantly over-expressed in brain tumor, indicating its relationship with brain tumor. More research deserves to be undertaken to validate such results.

Expression patterns of cancer-associated complexes in normal tissues

For complexes differentially expressed in a cancer, we compare their abundance levels in the cancer tissue with those in the originated normal tissue, and in the other normal tissues. Specifically, we mapped the differentially expressed complexes in each cancer to each normal tissue and classified differential expressions of these complexes according to the following four patterns: Pattern 1: over-expressed in the cancer tissue but under-expressed in the normal tissue Pattern 2: over-expressed in the cancer tissue as well as in the normal tissue Pattern 3: under-expressed in the cancer tissue but over-expressed in the normal tissue Pattern 4: under-expressed in the cancer tissue as well as in the normal tissue For each cancer, we count the number of complexes in each tissue of different Patterns (see Table S6). Then for each cancer, we list the number of complexes in the tissue from which it originated, along with the largest number of complexes among the other tissue other than its originated tissue, classified as the different Patterns (see Table S7). Figure 4 shows the distribution of the four differential expression patterns of cancer-associated complexes in their originated normal tissues. It can be seen that the dominant expression patterns are Patterns 1 (57.2%) and 3 (27.1%), whereas Patterns 2 and 4 complexes in originated normal tissues (1.15% and 3.87%) are minorities. In Table S7, we list the comparison of the four patterns in cancers' originated normal tissues with those in the other normal tissue with the maximum number of cancer-associated complexes. Table S7 shows that, compared with those in the other normal tissues, Pattern 1 complexes in originated normal tissues are much more numerous (57.2% vs. 22.6%); Pattern 2 and 4 complexes in originated normal tissues are much fewer (1.15% vs. 17.5% for Pattern 2; and 3.87% vs. 22.94% for Pattern 4); and Pattern 3 complexes has no significant difference (27.1% vs. 26.9%). Moreover, by the t-test, the expressions of Pattern 3 complexes in originated normal tissues have no significant difference from those in other normal tissues; whereas the expressions of Pattern 1, 2 and 4 complexes are significantly different from those in the other normal tissues, respectively.

Figure 4

Differentially expressed complexes in cancers and originated normal tissues.

Log-ratio of abundance in cancers (vertical axis) are defined with respect to the originated normal tissues, and that in normal tissues (horizontal axis) are defined with respect to all the other normal tissues. The log-ratio values in the “normal” range (–1, 1) are excluded for both cancers and normal tissues. Four different patterns are noted according to their differential abundance levels in cancers and their originated tissues.

From these observations, we can conclude that solid cancers tend to over-express complexes that are under-expressed in the normal tissues of the cancers' origin (Pattern 1). In other words, complexes that are not supposed to be expressed in a specific tissue but are over-expressed in this tissue can be related to cancers. Furthermore, solid cancers could over-express (or under-express) part of complexes that are over-expressed (or under-expressed) in normal tissues other than the cancer's tissue of origin (Patterns 2 and 4). These patterns could complement earlier findings on single gene expression pattern in cancers. For example, it was reported that genes over-expressed in human leukemias were rarely over-expressed in hematopoietic tissues39. Generally, cancers over-express only a fairly small part of genes that are selectively expressed in their originated tissues25. On the other hand, under-expressed complexes in cancers do not have statistically significant tendency to be over-expressed in the originated normal tissues (Pattern 3), which can be interpreted to mean that the lack of necessary complexes does not tend to cause cancers, in contrast to the existence of unnecessary complexes in Pattern 1. It is known that one form of cancer can affect many tissues, not only the tissue from which it originated. The expression patterns of cancer-associated complexes may indicate the cancer-tissue relations. One interesting way to verify the cancer-tissue relations from an external source is to use the Web search engine40. Our basic assumption is that the more Web pages Google finds from the search query with ‘[cancer name][tissue name] ', the more probably the tissue is related to the cancer. We measure cancer-tissue “Google correlation” (‘Google page' column in Table S6). For a specific cancer A, most Google correlation values for ‘[cancer A][originated tissue of cancer A]’ pair are ranked on the top among all the ‘[cancer A][tissue name].’ More precisely, 14 of the 39 cancers have the largest number of Google correlation value with their originated tissues. This result validates our assumption. In addition, from Table S6, for each cancer, we calculated the Pearson correlation coefficient between columns ‘Google pages' and column ‘Patterns 1–4,’ as shown in Table S8. The statistical significance test suggested that cancer-associated complexes are expressed according to Patterns 1, 2 or 4. Thus, we took the maximum values of Pearson correlation coefficient for Patterns 1, 2, and 4, and show them in the last column of Table S8. Most (about 3/4) of the Pearson correlation coefficients in the last column are positive, suggesting a positive correlation between cancer-tissues relations from Google correlation and those from the number of cancer-associated complexes with differential abundance levels.

Bipartite complex-cancer relations and common complexes associated with the same cluster of cancers

The previous subsection suggests that most cancer-associated complexes are Pattern 1 complexes in the originated normal tissues, i.e., over-expressed in the cancer tissue but under-expressed in the originated normal tissue. Thus we focus on these Pattern 1 complexes, and investigate the bipartite network between cancers and Pattern 1 complexes in cancer tissues. We constructed a bipartite network between cancers and Pattern 1 complexes, in which a cancer node is connected to a complex node if and only if this complex is a Pattern 1 complex of this cancer. In the bipartite network, we measured the topological similarity of the vertices according to the following Jaccard similarity index: where N is the set of neighbors of node u. Then Ward's clustering, a hierarchically agglomerative clustering method, was used to cluster the nodes in the network41. The hierarchical clustering starts off with each node being its own cluster and the distance between nodes u and v is defined as d(u, v) = 1 − J(u, v). At each step, pair of clusters (u, v) with the smallest distance d(u, v) is selected to be merged as a single cluster and distance measures between clusters are updated as the weighted sum of distances according to the Lance-Williams algorithm42, and the process is repeated until all nodes have been combined into one cluster, represented as a dendrogram with a hierarchical structure. In our case, d(u, v) = 2 is used as the threshold for cutting the hierarchical tree to yield the clustering structure. Figure 5 shows that some cancers are clustered because of their common over-expressed complexes, and also some complexes are clustered together.

Figure 5

Bipartite network of cancers and protein complexes of Pattern 1.

Triangles (circles) represent cancers (complexes), respectively. The numbers (and corresponding colors) on vertices show the clustering structure defined with the Jaccard similarity index (see the text).

We classify the 39 cancers under study into six categories according to Medical Subject Headings (MeSH43) annotation of their originated tissue categories: nerve tissue neoplasm, connective and soft tissue neoplasm, head and neck neoplasm, urogenital tissue neoplasm, digestive system neoplasm, and respiratory tract neoplasm. Biologically, cancers originated from same tissue should be correlated to some extent. In Table 4, we list the cluster indexes of the cancers in Figure 5 and their originated tissues. It can be seen that cancers originated from the same tissue category are clustered together. Figure 5 shows that cancers in the clusters 4, 5, 6 tend to link with complexes in clusters 10, 20 and 18 respectively, suggesting the association of these complex groups with nerve tissue cancers (cluster 4) and connective tissue cancers (cluster 5 and 6) respectively. To verify the correlation of the complexes in cluster 10 with nerve tissue cancers (cluster 4), we searched GoPubMed44 with complex names or gene names of the complex component proteins (in January of 2012) and listed the results in Table 5. A total of 13 of the 17 complexes show rank 1 association with cancers compared with all diseases, implying the important functions of these complexes in the occurrence or development of cancers. The associations of most complexes with nerve system diseases and nerve system cancers rank on the top of “All of Diseases” (more than 20 disease items) and “Neoplasms by Site” (more than 10 cancer tissue items) lists, respectively, demonstrating a high degree of correlation of complexes in cluster 10 with nerve system cancers. Moreover, proteins in some complexes such as cell cycle kinase complex CDK5, SMARCA2/BRM-BAF57-MECP2 complex and SMARCA2/BRM-BAF57-MECP2 complex have been extensively reported to be associated with eye cancer retinoblastoma, specifically implying the functions of these complexes in nerve systems cancers. In addition, 5 complexes in Table 5, CDC2-CCNA2 complex, Cell cycle kinase complex CDK5, Anti-HDAC2 complex, Emerin complex 52 and WINAC complex, are also identified over-expressed in brain tumor by GSEA (see Table 3), which cross-validates the correlation of these complexes with nerve system cancer. Similarly, the associations of complexes in cluster 20 with connective tissue cancers were shown in Table S9.

Table 4

Cancers classified by categories of their originated tissues and topology of the cancer-complex association network in Fig. 4

cluster index	cancer	originated tissue
1	acute quadriplegic myopathy	connective and soft tissue
	thyroid adenocarcinoma	head and neck
	germ cell tumor	urogenital tissue
	uterine tumor	urogenital tissue
2	alveolar rhabdomyo sarcoma	connective and soft tissue
	embryonal rhabdomyo sarcoma	connective and soft tissue
	leiomyosarcoma	connective and soft tissue
3	Hepatocellular carcinoma	digestive system
	esophageal adenocarcinoma	head and neck
	lung cancer	respiratory tract
	bladder cancer	urogenital tissue
	ovarian tumor	urogenital tissue
4	brain tumor	nerve tissue
	ganglioneuroblastoma	nerve tissue
	ganglioneuroma	nerve tissue
	glioblastoma	nerve tissue
	malignant peripheral nerve sheath tumor	nerve tissue
	neuroblastoma	nerve tissue
	neurofibroma	nerve tissue
	schwannoma	nerve tissue
5	chondroblastoma	connective and soft tissue
	chordoma	connective and soft tissue
	ewings sarcoma	connective and soft tissue
	osteosarcoma	connective and soft tissue
	spindle cell tumor	connective and soft tissue
6	chondromyxoid fibroma	connective and soft tissue
	chondrosarcoma	connective and soft tissue
	dedifferentiated chondrosarcoma	connective and soft tissue
	fibromatosis	connective and soft tissue
	monophasic synovial sarcoma	connective and soft tissue
	sarcoma	connective and soft tissue
7	liposarcoma	connective and soft tissue
	myxoidliposarcoma	connective and soft tissue
	head and neck squamous cell carcinoma	head and neck
	oral squamous cell carcinoma	head and neck
8	Kaposi sarcoma	connective and soft tissue
	kidney carcinoma	urogenital tissue
	renal cell carcinoma	urogenital tissue
	prostate cancer	urogenital tissue

Table 5

GoPubMed search results for the associations of complexes in cluster 10 with nerve tissue cancer. (Complexes with higher specificity are shown in the boldface.) Disease hits: number of PubMed papers indicating the association of searched item with diseases; neoplasms hits/rank: number of PubMed papers indicating the association of searched item with cancers and the rank of paper numbers in “All of Diseases” item of GoPubMed results. Association with nerve tissue cancer: number of PubMed papers indicating the association of searched item with nerve system diseases (box in the first row) and nerve system cancers (box in the second row) and the rank of paper numbers in “All of Diseases” and “Neoplasms by Site,” respectively

				association with Nerve tissue cancer
complex name	item searched	disease hits	neoplasms hits/rank	disease	hits/rank
Cell cycle kinase complex CDK5	CCND1	10111	8862/1	Retinoblastoma	983/3
				Nervous system neoplasms	204/9
RICH1-PAR3-aPKC polarity complex	PARD3	49	20/1	Nervous system diseases	12/2
				Nervous system neoplasms	1/4
Emerin complex 52	Emerin	305	25/9	Nervous system diseases	250/2
				Nervous system neoplasms	1/7
BCL6-HDAC7 complex	BCL6	758	684/1	Nervous system diseases	30/13
				Nervous sys neoplasms	15/5
Anti-HDAC2 complex	HDACs	1047	641/1	Nervous system diseases	126/5
				Nervous sys neoplasms	7/9
RNA polymerase II complex, chro structure modifying	RNA polymerase II complex	1529	568/2	Nervous system diseases	193/7
				Nervous sys neoplasms	13/7
SMARCA2/BRM-BAF57-MECP2 complex	SMARCA2	112	81/1	Retinoblastoma	13/2
				Eye neoplasm	13/1
CDC2-CCNA2 complex	CDC2	3777	2581/1	Eye diseases	464/4
				Eye neoplasm	444/1
	CCNA2	270	207/1	Eye diseases	25/4
				Eye neoplasm	22/4
CAV1-VDAC1-ESR1 complex	VDAC1	99	45/1	Nervous sys diseases	31/2
				Neuroblastoma	5/2
	CAV1	880	382/1	Nervous sys diseases	143/3
				Nervous system neoplasms	12/7
TGF-beta receptor II-TGF-beta3 complex	TGFB3	874	259/1	Nervous system diseases	89/11
				Nervous system neoplasms	9/8
Retrotranslocation complex	GEMIN4	26	9/2	Nervous sys diseases	15/1
	SYVN1	36	8/5	Nervous sys diseases	9/5
TRAP complex	Mediator complex	2518	805/1	Nervous system diseases	397/4
				Nervous system neoplasms	20/6
RAF1-RAS complex, EGF induced	RAF1	271	174/1	Nervous system diseases	44/6
				Nervous system neoplasms	8/5
	Ras	31223	22723	Nervous system diseases	2166/12
				Nervous system neoplasms	881/8
APLG1-Rababtin5 complex	Rab effec protein	141	38/1	Nervous sys diseases	19/5
WINAC complex	SMARCA2	112	81/1	Nervous system diseases	11/8
				Retinoblastoma	13/1
	SMARCA4	219	155/1	Eye diseases	20/7
				Retinoblastoma	17/2
	SMARCB1	291	282/1	Nervous system diseases	121/2
				Nervous system neoplasms	116/1
SNARE complex (STX11, VAMP2, SNAP23)	VAMP2	204	29/4	Nervous system diseases	41/3
MCM8-ORC2-CDC6 complex	CDC6	208	104/1	Eye diseases	13/8
				Retinoblastoma	12/2

Cancer-cancer correlations deduced from gene expression and complex abundance profiles

From our results, we see that many complexes predicted by our algorithm are important biological modules involved in the occurrence and development of solid cancers, and these modules suggest correlations of cancers to some extent. To verify if the predicted complexes could reflect the relationships between different cancers as the original gene expression data do, we hierarchically clustered the gene expression profile and complex abundance profile of all cancers and normal tissues under study, respectively. Similarity between groups is defined as the mean Pearson correlation coefficient between the sample profiles (hierarchical clustering trees in Figs. S1 and S2). Three large tissue categories include more cancers—soft tissue, nerve tissue and urogenital tissue are clustered together in both cases; i.e. both clustering results show the correlations of cancers and normal tissues of similar tissue categories. Similarly, according to the relative gene expression level and complex abundance of the cancers against their originated normal tissues by log-ratio values, we hierarchically clustered the cancers, respectively (Figs. 6 and 7). Figure 7 shows the heatmap of hierarchical clustering of the 39 cancers compared to each other, according to relative complex abundance of cancer against its originated normal tissue. Similar to the heatmap in Fig. 6, the clusters of cancers in Fig. 7 are mostly consistent with their tissue categories. We partitioned the cancers into 4 clusters according to the hierarchical trees in Figs. 6 and 7, respectively. Then we applied overlap score to quantity the similarity between the two partitions of cancers respectively generated from gene expression and protein complex profiles4546 and got the value of overlap score as 0.72. We then generated 200 pairs of random clusters of the cancers, in which the cluster sizes are the same as in the real data. The average overlap score of the random ensemble was calculated as 0.24, while the z-score46 for the overlap score of the two real partitions was 8.15, suggesting a fairly high extent of overlap between the two partitions of cancers with statistical significance. These results suggest that our predictions of complexes extract cancer modules from the expression data while not changing the inherent correlations of the data. Therefore, we can see that they reflect the intrinsic relationships among different cancers.

Figure 6

Heat map and hierarchical clustering of 39 cancers.

Similarity between cancers is defined as the Pearson correlation coefficient between the log-ratio expression profiles of genes that cancers contain.

Figure 7

Heat map and hierarchical clustering of 39 cancers.

Similarity between cancers is defined as the Pearson correlation coefficient between the log-ratio abundance profiles of complexes that cancers contain.

Discussion

Studies on the differential gene expression levels have added significant values to the genome-wide analyses having focused on genome sequencing, due to their condition-dependent dynamic nature. In other words, they indicate how the biological functions are phenomenologically realized for given “blueprints” of genome sequences and different environments. Our method can successfully identify cancer-associated complexes. We believe that it, from the assumption that protein complexes are real biological functional units, leads us to one step closer to biological reality. Our optimization procedure is based on linear programming (polynomial in computational time), implying that our method is feasible for future, larger studies. The method, as we apply it in this paper, rests on the assumption that expression levels are strongly correlated to protein abundance. Although signals from Affymetrix arrays used in our data sets can differ from the absolute protein abundance, considering the dataset's broad coverage in terms of both cancers and various tissues, this study provides a novel approach that can be adopted by other researchers who are possibly in possession of better datasets currently or in the future, we believe. Moreover, the advantage of protein-complex-based approaches, other than the identification of cancer-specific complexes, could be investigated further in the future.

Methods

Gene expression dataset

For gene expression data, we use recently released E-MTAB-62 in the Array Express repository32. It is an integration of 206 different experiments and 5372 samples generated in 163 different laboratories, including 369 different cell and normal tissue types, diseases, and cell lines. The most important aspect of this dataset is that all the data are from the same platform, pass data quality checks and get normalized so that we can compare the expression levels across different cancers/tissues. CEL files of samples that did not pass quality checks were removed. The retaining 5372 CEL files were normalized by Robust Multi-array Average (RMA) method, i.e., the raw intensity values were background corrected, log2 transformed and then quantile normalized47. In this work, we studied 39 solid tissue cancers (708 samples) and 25 normal solid tissue types (440 samples), in which 18 normal tissue types were where these cancers originated and thus were used as control sets (see Table 1).

Protein complex dataset

For the list of human protein complexes, we use the Comprehensive Resource of Mammalian protein complexes (CORUM) database33, where 1343 complexes and 2315 component proteins (the expression profiles of 2064 of these 2315 proteins are listed in E-MTAB-62 data) are listed in total as a core data. Among the core data, 1338 complexes, at least one of component proteins of which is assigned with the expression profile in E-MTAB-62 data, are used in our analysis.

Estimation of abundance levels of complexes based on optimization

The detailed background and procedure of our optimization algorithm is described in Ref. 15. Assume that the copy number of protein i (; N is the number of proteins) and the number of complex j (; where M is the number of complexes) are given by P and c, respectively. Also, suppose that we denote the number of protein i in the complex jas S, where S if the complex j does not include the protein i as its component. In the ideal situation where all the proteins in a cell are of the exact amount to be used in forming a complex, the variable sets and satisfy The question is how to determine (variables) with known values of and (constants). However, since the number of proteins N is usually larger than the number of complexes M, the set of linear equations above is over-determined, so in general it is not possible to satisfy all the equations in Eq. (1). In reality, therefore, the number of proteins in a cell should be greater than or equal to that necessary to form complexes, i.e., , which is the basic constraint of our optimization scheme. Instead of finding an exact solution satisfying Eq. (1), we try to minimize the deviation from the ideal situation in Eq. (1), given by the object function where the summation is only for indices i where P. Now, for the given values of P and , our basic strategy is to determine c values that minimize DA in Eq. (2), and this problem is numerically solved by the linear programming (LP) technique. Moreover, after the determination of c values, if some values of Pare unknown, we can assign those values of P using Eq. (1) for the ideal situation. This optimization is based on an assumption that organisms have been evolved in a way that increases efficiency by reducing wasted resources. In this work, the average expression level of gene encoding protein i is used as the P-value33 and the composition matrix S is approximated by the binary value ( = 1 if protein i is included in complex j, 0 otherwise). Ideally, it is more realistic to estimate protein complex levels from protein abundance as mRNA expression level cannot completely represent the true protein abundance. However, although several large proteomics data sets are available4849, currently there are no equally rich genome-wide protein abundance data sets for tumor versus normal tissue samples. Several studies have found mRNA and protein expression levels to be well correlated5051. It is reported that approximately 40% of the variation in mammalian protein abundance is explained by mRNA levels51. It is known that signals from Affymetrix arrays used in our data sets can differ from the absolute protein abundance. However, our method does not, strictly speaking, need to use absolute abundances - it is sufficient that the relative abundances are accurately measured, since all the objective functions and constraints in our linear programming (LP) optimization are strictly linear by definition. Therefore, the direct usage of gene expression levels as protein abundance is not free from errors, but it could yield reasonable results.

Identification of differentially expressed complexes in cancers and normal tissues

For each cancer or tissue case, individual genes' expression profiles are averaged over different samples in the E-MTAB-62 dataset, and the set is used as the input data of set. Our optimization procedure minimizing Eq. (2) will yield the set, i.e., complexes' abundance levels for the cancer or tissue. Then the abundance levels of all complexes in different cancers are compared with the abundance levels in the corresponding normal tissue in which these cancers originated; while the abundance levels of all complexes in different normal tissues are compared with the average abundance levels in all the other normal tissues. Over-expression(under-expression) of a complex is defined as at least 2-fold (at most 1/2-fold) change of abundance level.

Overlap score

We use overlap score to measure the overlap extent of cancer clusters respectively generated from gene expression and protein complex profiles4546. Consider two different categories A and B (for example, two partitions of cancers got by different clustering methods) and assume each cancer is associated with a subset (cluster) of the partitions of A and B. Let ϕ(i) and ϕ(j) denote the fraction of cancers in cluster i∈A and j∈B (i = 1,2,…, m; j = 1,2,…, n), respectively. Let ϕ(i,j) denote the joint frequency of i and j, i.e., the fraction of cancers that are partitioned in both cluster i ∈ A and j ∈ B. In a random distribution of clusters the expectation value of ϕ(i, j)is ϕ(i)ϕ(j). If the clusters of differ partitions are overlapping, some ϕ(i,j), the ones that overlap, will be larger than ϕ(i)ϕ(j), while for the others, ϕ(i,j) will be smaller than ϕ(i)ϕ(j). Thus, the overlapping of clusters in partitions A and B can be quantitatively measured by: Since the value of μ is affected by finite sizes, it is hard to judge if a μ-value indicates a good or bad overlap. Therefore, we normalize the μ-value against those of the perfect overlaps and define overlap score of partitions A and B as follows: The value of v is between 0 and 1, and it is 1 for perfect matches.

Author Contributions

J.Z., S.H.L., M.H. and P.H. conceived the study; S.H.L., J.Z. and M.H. developed the methods and analyzed the data; J.Z., S.H.L. and P.H. wrote the paper. All authors read an approved the final version of the manuscript.

46 in total

1. Exploration, normalization, and summaries of high density oligonucleotide array probe level data.

Authors: Rafael A Irizarry; Bridget Hobbs; Francois Collin; Yasmin D Beazer-Barclay; Kristen J Antonellis; Uwe Scherf; Terence P Speed
Journal: Biostatistics Date: 2003-04 Impact factor: 5.899

2. A tissue-specific atlas of mouse protein phosphorylation and expression.

Authors: Edward L Huttlin; Mark P Jedrychowski; Joshua E Elias; Tapasree Goswami; Ramin Rad; Sean A Beausoleil; Judit Villén; Wilhelm Haas; Mathew E Sowa; Steven P Gygi
Journal: Cell Date: 2010-12-23 Impact factor: 41.582

3. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes.

Authors: Andreas Ruepp; Alfred Zollner; Dieter Maier; Kaj Albermann; Jean Hani; Martin Mokrejs; Igor Tetko; Ulrich Güldener; Gertrud Mannhaupt; Martin Münsterkötter; H Werner Mewes
Journal: Nucleic Acids Res Date: 2004-10-14 Impact factor: 16.971

4. Identification of disease-causing genes using microarray data mining and Gene Ontology.

Authors: Azadeh Mohammadi; Mohammad H Saraee; Mansoor Salehi
Journal: BMC Med Genomics Date: 2011-01-26 Impact factor: 3.063

5. Insights into the pathogenesis of axial spondyloarthropathy from network and pathway analysis.

Authors: Jing Zhao; Jie Chen; Ting-Hong Yang; Petter Holme
Journal: BMC Syst Biol Date: 2012-07-16

6. Identifying dysregulated pathways in cancers from pathway interaction networks.

Authors: Ke-Qin Liu; Zhi-Ping Liu; Jin-Kao Hao; Luonan Chen; Xing-Ming Zhao
Journal: BMC Bioinformatics Date: 2012-06-07 Impact factor: 3.169

7. Ranking candidate disease genes from gene expression and protein interaction: a Katz-centrality based approach.

Authors: Jing Zhao; Ting-Hong Yang; Yongxu Huang; Petter Holme
Journal: PLoS One Date: 2011-09-02 Impact factor: 3.240

8. Global organization of protein complexome in the yeast Saccharomyces cerevisiae.

Authors: Sang Hoon Lee; Pan-Jun Kim; Hawoong Jeong
Journal: BMC Syst Biol Date: 2011-08-15

9. Identification of human housekeeping genes and tissue-selective genes by microarray meta-analysis.

Authors: Cheng-Wei Chang; Wei-Chung Cheng; Chaang-Ray Chen; Wun-Yi Shu; Min-Lung Tsai; Ching-Lung Huang; Ian C Hsu
Journal: PLoS One Date: 2011-07-27 Impact factor: 3.240

10. Modularity-based credible prediction of disease genes and detection of disease subtypes on the phenotype-gene heterogeneous network.

Authors: Xin Yao; Han Hao; Yanda Li; Shao Li
Journal: BMC Syst Biol Date: 2011-05-20

11 in total

1. A novel method for identifying disease associated protein complexes based on functional similarity protein complex networks.

Authors: Duc-Hau Le
Journal: Algorithms Mol Biol Date: 2015-04-28 Impact factor: 1.405

2. Biomolecular condensation of NUP98 fusion proteins drives leukemogenic gene expression.

Authors: Stefan Terlecki-Zaniewicz; Theresa Humer; Thomas Eder; Johannes Schmoellerl; Elizabeth Heyes; Gabriele Manhart; Natalie Kuchynka; Katja Parapatics; Fabio G Liberante; André C Müller; Eleni M Tomazou; Florian Grebien
Journal: Nat Struct Mol Biol Date: 2021-01-21 Impact factor: 15.369