Literature DB >> 25949096

Identifying driver genes in cancer by triangulating gene expression, gene location, and survival data.

Sigrid Rouam1, Lance D Miller2, R Krishna Murthy Karuturi3.   

Abstract

Driver genes are directly responsible for oncogenesis and identifying them is essential in order to fully understand the mechanisms of cancer. However, it is difficult to delineate them from the larger pool of genes that are deregulated in cancer (ie, passenger genes). In order to address this problem, we developed an approach called TRIAngulating Gene Expression (TRIAGE through clinico-genomic intersects). Here, we present a refinement of this approach incorporating a new scoring methodology to identify putative driver genes that are deregulated in cancer. TRIAGE triangulates - or integrates - three levels of information: gene expression, gene location, and patient survival. First, TRIAGE identifies regions of deregulated expression (ie, expression footprints) by deriving a newly established measure called the Local Singular Value Decomposition (LSVD) score for each locus. Driver genes are then distinguished from passenger genes using dual survival analyses. Incorporating measurements of gene expression and weighting them according to the LSVD weight of each tumor, these analyses are performed using the genes located in significant expression footprints. Here, we first use simulated data to characterize the newly established LSVD score. We then present the results of our application of this refined version of TRIAGE to gene expression data from five cancer types. This refined version of TRIAGE not only allowed us to identify known prominent driver genes, such as MMP1, IL8, and COL1A2, but it also led us to identify several novel ones. These results illustrate that TRIAGE complements existing tools, allows for the identification of genes that drive cancer and could perhaps elucidate potential future targets of novel anticancer therapeutics.

Entities:  

Keywords:  cancer; data mining; driver genes; gene expression; survival

Year:  2015        PMID: 25949096      PMCID: PMC4354331          DOI: 10.4137/CIN.S18302

Source DB:  PubMed          Journal:  Cancer Inform        ISSN: 1176-9351


Introduction

Cancer is characterized by the accumulation of genomic abnormalities that result in activated oncogenes and inactivated tumor suppressor genes. These deregulated genes are known as “driver genes.” Identifying genes that “drive” oncogenesis is central to improving our understanding of the mechanisms of cancer and to developing new anticancer therapies. Driver genes can be used as biomarkers of cancer susceptibility. For instance, inherited mutations in BRCA1 and BRCA2a are strong indicators of breast and ovarian cancer risk.1 Driver genes can also be used to define common genetic profiles shared by subgroups of patients who may benefit from targeted treatment strategies. For example, ERBB2b (also known as HER2/neu) is amplified and overexpressed in 20% to 25% of breast cancers2 and is the target of the monoclonal antibody trastuzumab (marketed as Herceptin, http://www.herceptin.com/breast/), a drug that is effective only when ERBB2 is amplified and overexpressed. Those who seek to compile a catalog of additional driver genes must attempt to distinguish them from the larger number of “passenger genes,” which have been disrupted as a result of cancer progression but do not confer growth or survival (dis)advantage. Driver genes may be deregulated through a number of mechanisms, operating at the levels of both DNA and RNA to trigger oncogenesis. The first genomic aberration consistently found to be associated with malignancy in humans was a translocation between BCRc and ABLd on chromosomes 9 and 22, a discovery that led chromosome 9 to be known as the Philadelphia chromosome.3,4 Following this discovery, a drug named Imatinib (commercialized as Gleevec, http://www.gleevec.com/) was developed to specifically inhibit the resulting fusion gene BCR-ABL. A translocation between TMPRSS2e and the ETS (E 26) family of genes (ERGf, ETV1g and ETV4h) occurs frequently in prostate cancer.3 Copy number alterations, such as genomic amplifications or deletions, are also common in cancer (eg, amplification of ERBB2 is common in breast cancer, as mentioned above). In addition, mutations can cause deregulation of driver genes leading to oncogenesis. For instance, the TP53i mutation, which makes the cell insensitive to signals of apoptosis,5 is present in most human tumors. Epigenomic modifications, such as histone methylation, acetylation, and chromatin modifications, also contribute to tumor formation and progression. By activating downstream oncogenes (eg, HRASj in gastric cancer) and by silencing tumor suppressors (eg, RB1k in retinoblastoma6), these modifications lead to chromosomal instability and to more frequent and aggressive tumors. We have recently developed a data-mining strategy called TRIAngulating Gene Expression (TRIAGE through clinico-genomic intersects) to guide the identification of potential driver genes, which are typically deregulated in only a subset of tumor samples. TRIAGE triangulates three levels of information: gene expression, gene location and clinical survival. We have used TRIAGE to discover and validate a novel oncogene RAB11FIP1l that promotes metastasis in breast cancer.7 TRIAGE has also been used to characterize patients with the fusion gene RPS6KB1-VMP1m, a mutation caused by tandem duplications.8 In this work, we describe recent refinements to the TRIAGE scoring methodology and we present the results of simulations testing this new scoring. Further, we describe the results obtained when we applied the newly refined TRIAGE approach to discover new candidate oncogenes and tumor suppressors in five human cancers. The first step in the TRIAGE methodology is to identify “expression footprints” (ie, regions that are either induced or repressed at the RNA level and are therefore referred to as “induced” or “repressed” expression footprints, respectively). These areas, which are identified using a novel measure called a Local Singular Value Decomposition (LSVD) score, may overlap at the level of DNA with other genomic events including copy number alterations, mutations or epigenomic changes and may contain driver genes. TRIAGE then uses dual survival analyses to distinguish driver genes from passenger genes located in the same expression footprint. The first survival analysis identifies the genes that are significantly associated with the time-to-event outcome (eg, time to local or distant recurrence) by fitting a Cox proportional hazards model9 over all patients in the cohort. The second survival analysis identifies potential driver genes by testing associations with the time-to-event outcome in the samples that are not characterized by these expression footprints. TRIAGE represents several improvements over classical approaches to the analysis of differential expression. First, unlike single whole cohort survival analysis, the TRIAGE approach allows one to distinguish between driver and passenger genes. Second, it is sensitive enough to detect driver genes that are deregulated in a small subset of patients, whereas classical analyses are only able to detect genes that are commonly deregulated in most patients (as described in a number of detailed reviews10–13). Third, it is able to identify the samples and genes that contribute to the expression footprint. Furthermore, contrary to previous methods,14 which derive a measure of significance for each sample separately, TRIAGE analyzes the whole tumor cohort simultaneously. Finally, unlike other methods,15 it does not require samples from normal tissues. Here, we present the statistical properties of the LSVD score, which we characterized using simulated data. We then present the results of our analysis to identify potential candidate driver genes by applying TRIAGE to five human cancers and we discuss the resulting catalog.

Methods

The TRIAGE approach comprised three main steps, as outlined in Figure 6, and described in detail below:
Figure 6

Overview of the TRIAGE methodology.

The LSVD score is used to identify induced (or repressed) genomic expression footprints from gene expression data. Genomic regions containing a substantial proportion of genes that are either over- or under-expressed in multiple tumors contain potential oncogenes or tumor suppressor genes, respectively. The LSVD score that we used to perform the analyses presented here has been refined in this version of TRIAGE. Unselected survival analysis is used to identify associations between patient survival and gene expression profiles in the expression footprints. Gene expression profiles that are significantly associated with either increased or reduced risk of failure by Cox proportional hazards regression models are indicative of potential oncogenes or tumor suppressor genes, respectively. Selected survival analysis is used to distinguish driver genes from passenger genes in the expression footprints. While the expression of passenger genes may be associated with survival, driver genes are expected to be associated with survival even in samples where the respective expression footprint is present; this is not the case for passenger genes. This expectation is based on the assumption that driver gene expression in tumors will often be deregulated by mechanisms other than copy number alteration or other regional events. The genes that are significantly associated with survival in both the unselected and selected survival analyses are interpreted as potential driver genes. These three steps are further detailed below.

Using LSVD score to identify induced (or repressed) genomic expression footprints

The objective of this step is to identify regions (ie, expression footprints) of co-expressed (ie, co-induced or co-repressed) genes and corresponding subgroups of tumors that share the same expression footprints. The problem can be posed as the analysis of an undirected bipartite graph between the set of tumor samples and the set of genes. An edge between a tumor sample and a gene is established if the gene is overexpressed (or repressed) in that particular sample, ie, if its expression is above (or below) a predefined threshold (denoted by a) in that particular sample, an edge between the tumor sample and the gene is established. Next, we identify dense subgraphs in which the connectivity between tumor samples and genes is higher. The link structure is then analyzed using Singular Value Decomposition (SVD) constrained upon the localization of gene nodes in the genome as described below. In the following procedures, we consider the measurement of gene expression for a set of n tumors over m genes. For each chromosome c; c ∈ {1,…,K}, let E denote the matrix of log2 of gene expression of dimensions n × mc (with ∑=m), where m is the number of genes on chromosome c. The expression footprints are identified by analyzing E using LSVD according to the following steps: Defining the bipartite graph structure by transforming E into binary connectivity matrix Y Deriving chromosome localized matrices Y, for each location “l” Applying SVD and computing the connectivity or LSVD score Δ Identifying the regions of interest (ie, expression footprints). While the first step is slightly different for the analysis of induced and repressed expression footprints (as described below), steps 2, 3, and 4 remain the same.

Defining the gene–tumor bipartite graph

E is transformed into binary connectivity matrices A for induced expression footprints and D for repressed expression footprints through the discretization rule. Induced expression footprints. For each tumor sample j and gene i, the transformation of the expression data into A is obtained as follows: where v is either the median of . Although not required by the method, if samples from normal tissues are available, we can alternatively use the mean of the signal for gene i among normal tissue samples for v; and σ is the corresponding adjusted median absolute deviation (MAD). Repressed expression footprints For each tumor sample j and gene i, matrix E is transformed into D as Y = A to identify induced expression footprints and Y=D to identify repressed expression footprints.

Deriving localized matrices

To account for the localization of expression footprints in the genome, we derive localized connectivity matrices. Local matrices Y at location “l are derived from Y using genes located in on chromosome c, where ϖ is the user-set window size.

Performing SVD of localized matrices Y

SVD de com poses a matrix Yof dimensions n × m into a product of three matrices U, Σ, and Vsuch that where U and V are of dimension n × n and m × m, respectively, and Σ is a n × m rectangular diagonal matrix.16 Σ contains singular values in descending order (by convention). The columns U are called the left singular vectors (ordered by importance or eigen weights), which form an orthonormal basis, ie, u · u = 1 for i = j and u · u = 0 otherwise. Similarly, the rows of V contain the elements of the right singular vectors (ordered by importance) and form an orthonormal basis. The largest or principal singular value of Y summarizes the density of the network. Its value increases with the number of links but it does not allow one to distinguish between networks in which links are concentrated around a few genes and tumor samples and networks in which links are spread among different genes and/or tumor samples. To account for this observation, in the newer version of the LSVD score, the singular vectors associated with the principal singular value are also appropriately included in the definition of improved version of LSVD score as described in the next subsection.

Identifying the regions of interest

As shown by Kleinberg,17 the discriminative ability of the principal singular value increases with the number of repeated multiplications of the matrix to be decomposed. In order to build the final score, SVD is thus applied on matrices P and Q, which are based on the repeated multiplication of the square matrices and , respectively, and are defined below. and These matrices can be decomposed by SVD as where U and V contain the singular vectors associated with the tumor samples and the genes, respectively. Then, an LSVD score Δ at l is obtained by weighing the principal singular value of P denoted by (which is also the principal singular value of Q) by the corresponding first r values (r = min(n,m)) of the ordered principal singular vectors of P and Q: In the above formula, the value of is linked to the number of links between the tumor samples and the genes. The weights and V[∈{1,…,},1] are associated with the tumor samples and genes respectively and summarize the importance of the nodes in the network structure (ie, the number of links as well as the importance of the nodes to which they are connected; see Kleinberg17 for a detailed interpretation). Higher the LSVD score, the higher confidence in the expression footprint around l indicating that the genes at this location are contributing to the expression footprint in a subset of tumor samples. Finally, the genomic regions with consecutive LSVD scores above the predetermined threshold represent the expression footprints. The weights and V[∈{1,…,},1] are used to identify the tumor samples and genes around location l that contribute to these footprints as their weights are different from zero. Relative to the previous version of TRIAGE, the incorporation of principal singular vectors in this step in the current version is a major refinement.

Dual Survival Analysis

Dual survival analysis is used to distinguish between driver and passenger genes. Let E denote the log2 of expression of gene i;I ∈{1,…, m} in tumor sample j;j ∈{1,….,n} Let T be the possibly censored survival time for each tumor sample j. Unselected survival analysis First, for each gene i, in a selected expression footprint, a Cox proportional hazards (Cox-PH) model9 is fit. The Cox-PH model is defined by the following hazards function where h0 (t) is an unknown baseline hazards function and β is a parameter to be estimated. The model can also be expressed in terms of the survival function at time t as The score statistic and associated P-value are then used to assess the significance of the association. Selected survival analysis Since passenger genes located in the expression footprints may also be associated with the survival outcome, a so-called selected survival analysis is conducted to reevaluate the association between survival and gene expression in the absence of expression footprint. Driver genes are indeed assumed to influence the survival outcome even in the tumor samples that do not have the expression footprint.7 For this reason, model (1) is applied to the tumor samples that do not contribute to the footprint (ie, the tumor samples with weights equal to 0 for the expression footprint consideration). The score statistic and associated P-value are used to assess the significance of the survival association. Finally, genes that are significantly associated with survival in both the unselected and selected survival analyses are interpreted as candidate driver genes.

Simulations to Study the Properties of the LSVD Score

Simulation scheme

We conducted a simulation study to evaluate the statistical properties of the LSVD score (Δ), when used to identify expression footprints. Induced and repressed expression footprints were considered separately. We simulated gene expression datasets composed of m = 1,000 genes profiled among n = 100 tumor samples. Gene expression values were simulated with a log-normal distribution Log − N (μ,σ) with expectation e+0 5.2 and variance e2+2 (e2 −1).18 Parameter σ was taken to be 1. The value of μ was equal to 0 for genes that did not belong to the expression footprint, was positive for genes in regions of induced expression footprint, and was negative for genes in regions of repressed expression footprint. A log2 transformation was then applied and the resulting expression was denoted as X for gene i;i ∈{1,…,m} in tumor sample j; j ∈{1,…,n}. As genes involved in the same or related pathway are likely to be co-expressed, we generated datasets with so-called “clumpy” dependence (ie, while gene measurements are dependent upon each other in small groups, measurements in each group are independent from the other groups) using the following procedures.19,20 For each group of 10 genes indexed by k;k = 1,⋯,10, a random vector R = R, i = 1,⋯,n, was generated from a standard normal distribution N (0 1,). The data matrix E was then built so that , where ρ was the correlation between two groups of genes chosen to be 0, 0.25, 0.5 or 0.75. Finally, in order to evaluate the behavior of our LSVD score in approximating real genomic data analysis, we standardized the dataset using quantile normalization.21 Number of configurations were considered in order to study the influence of (i) percentage of tumors contributing to the expression footprint, (ii) number of genes forming the expression footprint, (iii) mean value of the log-normal distribution μ, (iv) window size ω, (v) threshold parameter a used to define deregulated expression, and (vi) correlation parameter ρ. These configurations are summarized in Table 1. Additional details for each configuration are provided in Supplementary File 1. For each configuration, 200 repetitions of simulations were performed.
Table 1

Parameter settings used in the six simulated configurations.

CONFIGURATIONVALUE OF THE PARAMETERS
% OF TUMORSNUMBER OF GENES IN THE FOOT PRINTMEAN VALUE OF GENE EXPRESSIONWINDOW SIZE (W)THRESHOLD (A)CORRELATION
INDUCEDREPRESSED
(i)5 to 80%203−351.50
(ii)20%5 to 1003−351.50
(iii)20%201.5 to 4−4 to −1.551.50
(iv)20%203−35 to 201.50
(v)20%203−351 to 20
(vi)20%203−351.50 to 0.75

Simulation Results

Simulations results for configurations (i), (ii), and (iii) are presented in Figures 1, 2, and 3, respectively. Simulation results for configurations (iv), (v), and (vi) are shown in Figures S1, S2, and S3 in Supplementary File 2, respectively.
Figure 1

Graph (A) and boxplot (B) of the LSVD scores for 1,000 overexpressed, simulated genes contributing to the expression footprint for varying percentages of tumor samples (representing over 200 repetitions).

Figure 2

Graph (A) and boxplot (B) of the score value for 1,000 overexpressed simulated genes, for different expression footprint sizes over 200 repetitions.

Figure 3

Graph of the score value for 1,000 overexpressed (A) and underexpressed (B) simulated genes, for different expression levels and for 200 repetitions.

Figure 1 presents the variation of the LSVD score for different percentages of tumor samples contributing to the induced expression footprint. The LSVD score Δ allows one to detect the expression footprints that are shared by 5 to 30% of the samples. Simulations for repressed expression footprints yielded similar results (data not shown). Results for expression footprints of varying sizes (Fig. 2) show that the value of Δ is saturated for expression footprints >10 genes because of the window size of ω = 5 genes. The value of Δ is smaller for expression footprints of sizes 10 or fewer genes. We obtained similar results for repressed expression footprints (data not shown). The influence of the mean expression change is depicted in Figure 3A for induced expression footprints and in Figure 3B for repressed expression footprints. The score increases with the absolute value of the gene effect. The greater the absolute mean value of the log-normal distribution, the higher the score. Figure S1 in Supplementary File 2 presents the results of simulations similar to configuration (ii) for different window sizes (ω = 5; 10; 20). The LSVD score is lower for smaller window sizes while its variance is larger. Figure S2 in Additional File 2 shows the results of simulations similar to configuration (ii) for different threshold parameters (a = 1; 1.5; and 2) indicating that the value of Δ is robust to the variation of a. Finally, the influence of the correlation parameter on Δ was determined for simulation configuration (vi) for ρ = 0; 0.25; 0.5; 0.75. Figure S3 in Additional File 2 shows that the higher the correlation, the smaller the value of Δ, as the addition of the correlation introduces noise in the dataset. Even with this noise, expression footprints are still detectable. Simulations using different sample sizes (n = 50; 100; 200; 500) yielded similar results (data not shown). Thus, the sample size did not affect the value of Δ.

Deriving Driver Gene Catalogs in Five Cancers

In this section, we present the results obtained when we used TRIAGE to identify candidate driver genes that are deregulated in subpopulations of tumors. We used five large datasets representing cancers of the breast, ovary, lung, colon, and glioma. The datasets that we used are summarized in Table 2; sample sizes varied from 111 to 741 patient tumors. Gene expression was measured using Affymetrix HU133A, HU133B, and HU133Plus2.0 arrays (Affymetrix, Santa Clara, CA, USA). We used na32 annotation files obtained from Affymetrix (http://www.affymetrix.com). Raw data were normalized using quantile normalization.21 We averaged the measurements of transcripts that corresponded to the same gene on a chromosome. Different types of survival outcomes were available in different datasets, defined as follows. Overall survival (OS) was defined as the time from inclusion of the respective patient in the study (eg, surgery) until death or last follow-up. Relapse free survival (RFS) was defined as the time from inclusion until disease-related death, disease recurrence (either local or distant), or last follow-up. Disease metastasis-free survival (DMFS) was defined as the time interval between inclusion to the first distant recurrence event or to last follow-up.
Table 2

Description of the different cancer studies. The two rightmost columns give the number of putative driver genes identified by TRIAGE.

GEO IDCANCER TYPEPLATFORMSAMPLE SIZENO OF SURVIVAL DATASURVIVAL OUTCOMEREF.NO OF POTENTIAL ONCOGENESNO OF POTENTIAL TUMOR SUPPRESSORS
GSE9891OvarianHU133Plus2.0295220RFS[29]171227
GSE16011GliomaHU133Plus2.0284266OS[30]985784
GSE17538ColonHU133Plus2.0232232RFS[31]7163
GSE3141LungHU133Plus2.0111111OS[32]4626
Combined study*BreastHU133A+B741624DMFS[33]445118

Notes:

GSE3494, GSE1456, GSE6532, GSE4922.

For each dataset, Δ was calculated for each chromosomal arm with a sliding window of size ω = 5. Induced and repressed expression footprints were identified separately. A threshold corresponding to median(Δ)± 2mad (Δ), with “mad” representing adjusted MAD, was chosen to identify relevant expression footprints. Regions with LSVD score exceeding this threshold were selected and extended by ω on either side for dual survival analyses. The extended regions were considered to be expression footprints. Unselected and selected survival analyses were performed using the genes within the expression footprints. Associations between gene expression and “poor” prognosis were obtained for the genes located within induced expression footprints in order to identify potential oncogenes. Associations between gene expression and “good” prognosis were obtained for those within repressed expression footprints in order to identify potential tumor suppressor genes. A threshold of P = 0.05 was used to indicate statistical significance for both sets of survival analyses. Circular plots22 (see Supplementary File 3) provide an overview of these results, including the value of Δ for each chromosome and the location of potential oncogenes and tumor suppressors along the genome. For instance, the plot in Figure S5 in Supplementary File 3 (from the breast cancer study) indicates that the highest values of Δ were located on chromosome 17q for the induced expression footprint and on chromosome 7p for the repressed expression footprint. The largest sets of potential oncogenes were observed on chromosomes 17q, 19, and 8. Potential tumor suppressors were located throughout the genome. Supplementary Files 4 and 5 provide lists of putative oncogenes and tumor suppressors selected by TRIAGE for different cancers studied. A pathway analysis on the selected genes (1638 oncogenes and 1196 tumor suppressors) performed using Ingenuity Pathway Analysis (Ingenuity Systems, www.ingenuity.com); see Supplementary File 6) shows significant enrichment in cancer annotated genes, most specifically in carcinoma, in solid tumor, and in several other types of tumors and cancers. A total of 786 genes were classified under this category. Other pathways commonly observed in cancers including apoptosis, cell death, cell growth and proliferation, and tumor morphology were also significantly enriched. From the lists of potential oncogenes and tumor suppressors, we intersected those common to the different cancer types (see Fig. 4 for summary statistics). The number of genes commonly expressed in different cancer types was relatively small compared to the total number of genes identified for each individual cancer, indicating that driver genes are specific to a given cancer type. The most commonly expressed genes are listed in Table 3; the top four genes from this list are present in at least three different cancers. Among them, MMP1n belongs to the matrix metalloproteinase (MMP) family, which is known to play a role in metastasis as up-regulation of MMPs lead to enhanced cancer cell invasion.23 IL8o is an important mediator of the inflammatory response and is implicated in various cancer types.24,25 COL1A2p encodes one of the chains for type I collagen and is methylated in multiple cancer cell lines.26,27 It is also involved in a fusion with gene PLAG1 in lipoblastoma, a benign infant tumor.28
Figure 4

Number of genes in common among different cancer types.

Table 3

Genes common to the five cancer studies.

CHR.GENE SYMBOLGENE NAMEGSE17536 COLONGSE16011 GLIOMAGSE3494 BREASTGSE3141 LUNGGSE9891 OVARIANTOTAL
Oncogenes
4qIL8Interleukin 81*1*1*1*0*4
7qCOL1A2Collagen, type I, alpha 20*10113
9qASPNAsporin110013
11qMMP1Matrix metallopeptidase 1 (interstitial collagenase)1*1*1*0*0*3
1qRPTNRepetin010102
1qS100A2S100 calcium binding protein A21*10*0*02
3qPLSCR4Phospholipid scramblase 4010012
3qTM4SF1Transmembrane 4 L six family member 11*10*0*0*2
4qLOC255130Uncharacterized LOC255130010102
4qCXCL6Chemokine (C-X-C motif) ligand 6 (granulocyte chemotactic protein 2)0*10*1*02
4qCXCL3Chemokine (C-X-C motif) ligand 30*10102
4qAREGAmphiregulin (schwannoma-derived growth factor)1*1*0*0*0*2
4qC4orf46Chromosome 4 open reading frame 46011002
5qFSTFollistatin0100*1*2
5qGPX8Glutathione peroxidase 8 (putative)010012
5qC5orf46Chromosome 5 open reading frame 46110002
6pF13A1Coagulation factor XIII, A1 polypeptide010012
6pLY86Lymphocyte antigen 86010012
6pHIST1H1DHistone cluster 1, H1d011002
6pHIST1H2ALHistone cluster 1, H2al010102
6qEYA4Eyes absent homolog 4 (Drosophila)010102
7pSKAP2Src kinase associated phosphoprotein 2010012
7pHOXA3Homeobox A3010012
7pHOXA-AS2HOXA cluster antisense RNA 2 (non-protein coding)010012
7pHOXA4Homeobox A401001*2
7pHOXA5Homeobox A5010*0*12
7pTAX1BP1Tax1 (human T-cell leukemia virus type I) binding protein 1010102
7qHGFHepatocyte growth factor (hepapoietin A; scatter factor)0*1*0*0*1*2
7qGNG11Guanine nucleotide binding protein (G protein), gamma 11110002
8pDLC1Deleted in liver cancer 10*10*0*12
8qANGPT1Angiopoietin 10*1*0*0*1*2
8qGPR172AG protein-coupled receptor 172A011002
9qECM2Extracellular matrix protein 2, female organ and adipocyte specific010012
10qZWINTZW10 interactor011002
10qACSL5Acyl-Coa synthetase long-chain family member 50*1*0012
11pHRASV-Ha-ras Harvey rat sarcoma viral oncogene homolog0*1*1*0*0*2
11pFIBINFin bud initiation factor010012
11pLGR4Leucine-rich repeat-containing G protein-coupled receptor 4001012
11qCFL1Coflin 1 (non-muscle)0*0*1*1*0*2
11qPPP1CAProtein phosphatase 1, catalytic subunit, alpha isoform011*002
11qPDGFDPlatelet derived growth factor D01*00*1*2
11qCASP4Caspase 4, apoptosis-related cysteine peptidase010*0*1*2
12pNDUFA9NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 9, 39 kDa011002
12pOLR1Oxidized low density lipoprotein (lectin-like) receptor 1010*012
12pGABARAPL1GABA(A) receptor-associated protein like 1100*012
12qHOXC13Homeobox C13011002
12qHOTAIRHox transcript antisense intergenic RNA0*11*002
12qRAP1BRAP1B, member of RAS oncogene family01*1002
12qATP5HATP synthase, H+ transporting, mitochondrial f0 complex, subunit d011002
12qNUP107Nucleoporin 107 kDa011002
12qMDM2Mdm2, transformed 3T3 cell double minute 2, p53 binding protein (mouse)0*1*1*0*0*2
12qLUMLumican1*00*0*12
12qDCNDecorin1*0*0*0*1*2
13qMIR1244–1MicroRNA 1244-1011002
13qPTMAProthymosin, alpha (gene sequence 28)0*110*02
13qLOC441454Prothymosin, alpha pseudogene011002
14qSNORD114–3Small nuclear RNA, C/D box 114–3100012
16pC16orf59Chromosome 16 open reading frame 59001102
16qGOT2Glutamic-oxaloacetic transaminase 2, mitochondrial (aspartate aminotransferase 2)011002
16qCKLFChemokine-like factor011002
17qBRIP1BRCA1 interacting protein C-terminal helicase 10*11*00*2
17qABCA8ATP-binding cassette, sub-family A (ABC1), member 8010012
18qALPK2Alpha-kinase 2010012
18qDSELDermatan sulfate epimerase-like010012
19qC19orf48Chromosome 19 open reading frame 48011002
20qRPN2Ribophorin II011*002
20qTTI1TELO2 interacting protein 1011002
22qPOM121L9PPOM121 membrane glycoprotein-like 9, pseudogene010012
Tumor Suppressor genes
1qTXNIPThioredoxin interacting protein0*11*002
1qTTC13Tetratricopeptide repeat domain 13010012
2qMRPL30Mitochondrial ribosomal protein L30010012
4pCCDC96Coiled-coil domain containing 96100012
5qDMXL1Dmx-like 1010102
6pPPP2R5DProtein phosphatase 2, regulatory subunit B’, delta isoform010012
8qZNF704Zinc finger protein 704110002
8qPAG1Phosphoprotein associated with glycosphingolipid microdomains 11100*02
10qANK3Ankyrin 3, node of Ranvier (ankyrin G)010102
11qPITPNM1Phosphatidylinositol transfer protein, membrane-associated 110001*2
12qSETSET translocation (myeloid leukemia-associated)0*10012
17pLoC284014Uncharacterized LOC284014010012
17pZfP3Zinc finger protein 3 homolog (mouse)010012
17pC17orf81Chromosome 17 open reading frame 81010012
17pCYB5D1Cytochrome b5 domain containing 1100012
18qTCF4Transcription factor 40*1*1*0*02
19pCFDComplement factor D (adipsin)001102
21pLOC100132288NA011002
21pLOC389834Hypothetical gene supported by AK123403011002

Notes: For each study, 1 indicates that the gene was selected by the TRIAGE methodology and 0 indicates otherwise. A star (*) indicates that the gene has been shown to be associated with the disease (according to Genecards, www.genecards.org). The last column of the table presents the number of studies in which the gene was identified as potential driver. Details on hazard ratios and P-values are provided in Additional Files 4 and 5.

Figure 5 displays the centered and normalized LSVD score for the different windows containing MMP1, IL8, and COL1A2 (0 corresponds to the window centered on the considered gene). Stars indicate studies where the gene was significantly associated with both the unselected and selected survival. For most studies, the LSVD score for MMP1 was high for all window sizes, suggesting that this gene belongs to a larger region of deregulation. In a similar way, IL8 is deregulated as part of a large expression footprint. The deregulation of MMP1 and IL8 is strongly associated with a dosage effect. For COL1A2, high LSVD scores are more localized indicating that, although this gene does not belong to a large region of deregulation, it might be activated by other mechanisms, such as methylation or fusion.
Figure 5

Heatmap representation of the LSVD score for the windows centered on (A) MMP1, (B) IL8, (C) COL1A2.

Note: A star * indicates if the gene was shown to be associated with the cancer (according to Genecards, www.genecards.org).

Discussion

In this paper, we presented refinements of the TRIAGE method, an approach that we developed to identify potential driver genes. We first characterized the LSVD score using simulation. We next identified known and novel driver genes in cancer using gene expression data, genomic information, and survival data. TRIAGE uses two main steps. First, Δ is derived using a LSVD score to identify regions of deregulated expression, called expression footprints. Then, two survival analyses were performed on the genes located within these selected loci. The score derived in the first step represents the linkage structure between the set of genes and the set of tumor samples. This score is obtained using three factors: the principal singular value, which quantifies the number of links between the tumor samples and the genes and the two ordered principal singular vectors of the LSVD, which together summarize the connectivity of the network. Calculating this score using local matrices allows us to take into account the location of expression footprints throughout the genome. Indeed, genes located in the same region are more likely to influence each other or to be influenced by chromosomal or epigenetic events. In the second step, dual survival analyses are used to distinguish driver genes from passenger genes. First, unselected survival analysis is used to identify the genes that are significantly associated with time-to-event outcomes. A selected survival analysis is conducted next by excluding the tumor samples that contribute to the expression footprint in order to distinguish driver genes from passenger genes. Driver genes are presumed to have an impact on survival even in the absence of the corresponding expression footprint, whereas passenger genes are selected only because they are co-located with a driver gene and thus belong to the expression footprint. Potential driver genes are those that are significantly associated with survival in both the unselected and selected survival analyses. Our simulation results illustrated that the value of Δ increases with the size of both the expression footprint size and the relative risk. While it is robust to varying threshold parameters (ie, within a range of 1–2), it is affected by the size of the window, although it is important to note that this is not a problem if the same window size is used throughout the analysis. Indeed, the score was only saturated in larger expression footprints, which were few in number. Using real datasets derived from five different cancer types, we illustrated that TRIAGE was able to identify potential driver genes that were enriched for biological processes known to be involved in cancer progression. Among the selected genes, known oncogenes such as MMP1, IL8, and COL1A2 were identified as drivers for multiple cancers. Many new potential driver genes were identified and further biological validation studies would be invaluable to confirm or disprove the importance of these genes in the etiology of cancer. Our results illustrate that TRIAGE offers several advantages over traditional methods of expression analysis, which select genes that are commonly over- or underexpressed. In contrast, TRIAGE relies on patient heterogeneity to highlight different subtypes of gene expression. TRIAGE is thus a useful tool for identifying the genes that distinguish between subgroups of patients having the same disease but differing in their genomic profiles, including differences in active driver genes. These subpopulations could thus potentially benefit from different treatments. Such patient-specific approaches are central to the increasingly influential field of personalized medicine. TRIAGE is not without some limitations. Here, we focused on the analysis of gene expression. However, the mechanisms that underlie cancer are tremendously complex, involving a host of other genomic aberrations including copy number variations, mutations, and fusions. We anticipate that future refinements to the TRIAGE approach will allow us to account for these influences. TRIAGE is limited to the identification of driver genes harbored in regions associated with deregulated gene expression. However, many genes become deregulated in isolation through many mechanisms. For example, p53 is deregulated by a deleterious mutation but the expression of its genomic region is not deregulated. Similarly, fusion genes may be formed by translocations in the absence of tandem duplications. These limitations notwithstanding, TRIAGE is a valuable tool to identify driver genes that are associated with regions of deregulated gene expression in cancer and may perhaps be applicable to other vexing conditions as well. Additional file 1. Detailed description of the simulations. Description: Additional details are given for the different configurations of simulations. Additional file 2. Additional figures for the simulation results. Additional file 3. Circular plots for the breast, ovarian, lung colon cancers and glioma datasets. Additional file 4. List of putative oncogenes selected by the TRIAGE methodology for the five cancers. Description: Tables providing the list of genes identified for each dataset. Additional file 5. List of putative tumor suppressor selected by the TRIAGE methodology for the five cancers. Description: Tables providing the list of genes identified for each dataset. Additional file 6. Pathway analysis of the selected genes using Ingenuity Pathway Analysis.
  23 in total

1.  Nonparametric methods for identifying differentially expressed genes in microarray data.

Authors:  Olga G Troyanskaya; Mitchell E Garber; Patrick O Brown; David Botstein; Russ B Altman
Journal:  Bioinformatics       Date:  2002-11       Impact factor: 6.937

Review 2.  Discovery of the Philadelphia chromosome: a personal perspective.

Authors:  Peter C Nowell
Journal:  J Clin Invest       Date:  2007-08       Impact factor: 14.808

3.  Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes.

Authors:  Xing Qiu; Lev Klebanov; Andrei Yakovlev
Journal:  Stat Appl Genet Mol Biol       Date:  2005-11-22

4.  Circos: an information aesthetic for comparative genomics.

Authors:  Martin Krzywinski; Jacqueline Schein; Inanç Birol; Joseph Connors; Randy Gascoyne; Doug Horsman; Steven J Jones; Marco A Marra
Journal:  Genome Res       Date:  2009-06-18       Impact factor: 9.043

5.  Wavelet transformations of tumor expression profiles reveals a pervasive genome-wide imprinting of aneuploidy on the cancer transcriptome.

Authors:  Amit Aggarwal; Siew Hong Leong; Cheryl Lee; Oi Lian Kon; Patrick Tan
Journal:  Cancer Res       Date:  2005-01-01       Impact factor: 12.701

6.  PLAG1 fusion oncogenes in lipoblastoma.

Authors:  M K Hibbard; H P Kozakewich; P Dal Cin; R Sciot; X Tan; S Xiao; J A Fletcher
Journal:  Cancer Res       Date:  2000-09-01       Impact factor: 12.701

Review 7.  Interleukin-8 and human cancer biology.

Authors:  K Xie
Journal:  Cytokine Growth Factor Rev       Date:  2001-12       Impact factor: 7.638

8.  Transcriptional consequences of genomic structural aberrations in breast cancer.

Authors:  Koichiro Inaki; Axel M Hillmer; Leena Ukil; Fei Yao; Xing Yi Woo; Leah A Vardy; Kelson Folkvard Braaten Zawack; Charlie Wah Heng Lee; Pramila Nuwantha Ariyaratne; Yang Sun Chan; Kartiki Vasant Desai; Jonas Bergh; Per Hall; Thomas Choudary Putti; Wai Loon Ong; Atif Shahab; Valere Cacheux-Rataboul; Radha Krishna Murthy Karuturi; Wing-Kin Sung; Xiaoan Ruan; Guillaume Bourque; Yijun Ruan; Edison T Liu
Journal:  Genome Res       Date:  2011-04-05       Impact factor: 9.043

9.  Adjuvant trastuzumab in HER2-positive breast cancer.

Authors:  Dennis Slamon; Wolfgang Eiermann; Nicholas Robert; Tadeusz Pienkowski; Miguel Martin; Michael Press; John Mackey; John Glaspy; Arlene Chan; Marek Pawlicki; Tamas Pinter; Vicente Valero; Mei-Ching Liu; Guido Sauter; Gunter von Minckwitz; Frances Visco; Valerie Bee; Marc Buyse; Belguendouz Bendahmane; Isabelle Tabah-Fisch; Mary-Ann Lindsay; Alessandro Riva; John Crown
Journal:  N Engl J Med       Date:  2011-10-06       Impact factor: 91.245

10.  CpG hypermethylation of collagen type I alpha 2 contributes to proliferation and migration activity of human bladder cancer.

Authors:  Katsuhisa Mori; Hideki Enokida; Ichiro Kagara; Kazumori Kawakami; Takeshi Chiyomaru; Shuichi Tatarano; Kazuya Kawahara; Kenryu Nishiyama; Naohiko Seki; Masayuki Nakagawa
Journal:  Int J Oncol       Date:  2009-06       Impact factor: 5.650

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.