Literature DB >> 29545918

Recurrent copy number alterations in young women with breast cancer.

Chen Chi1,2, Leigh C Murphy1,3, Pingzhao Hu1,2,4.   

Abstract

Breast cancer diagnosis in young women has emerged as an independent prognostic factor with higher recurrence risk and death than their older counterparts. We aim to find recurrent somatic copy number alteration (CNA) regions identified from breast cancer microarray data and associate the CNA status of the genes harbored in the regions to the survival of young women with breast cancer. By using the interval graph-based algorithm we developed, and the CNA data consisting of a Discovery set with 130 young women and a Validation set with 125 young women, we identified 38 validated recurrent CNAs containing 39 protein encoding genes. CNA gain regions encompassing genes CAPN2, CDC73 and ASB13 are the top 3 with the highest occurring frequencies in both the Discovery and Validation dataset, while gene SGCZ ranked top for the recurrent CNA loss regions. The mutation status of 9 of the 39 genes shows significant associations with breast cancer specific survival. Interestingly, the expression level of 2 of the 9 genes, ASB13 and SGCZ, shows significant association with survival outcome. Patients with CNA mutations in both of these genes had a worse survival outcome when compared to patients without the gene mutations. The mutated CNA status in gene ASB13 was associated with a higher gene expression, which predicted patient survival outcome. Together, identification of the CNA events with prognostic significance in young women with breast cancer may be used in genomic-guided treatment.

Entities:  

Keywords:  breast cancer; graph algorithm; recurrent copy number alterations; risk genes; young women

Year:  2018        PMID: 29545918      PMCID: PMC5837756          DOI: 10.18632/oncotarget.24336

Source DB:  PubMed          Journal:  Oncotarget        ISSN: 1949-2553


INTRODUCTION

Although young women only account for 7% of all breast cancers, it is the most common cancer among young females [1]. Yet, young age at diagnosis of breast cancer has emerged as an independent factor for higher recurrence risk and death in various studies [2-6]. Breast cancer in young women has been described to have more biologically aggressive tumours (basal and HER2-enriched subtypes) than in older counterparts, which has been associated with a poorer prognosis [6]. Several factors influence poor prognosis in the young subgroup, such as higher tumour grade at diagnosis, high tumour proliferation, increased expression of HER-2 (ERB-B2) and reduced expression of both estrogen (ER) and progesterone receptor (PR) [7]. These women often struggle with life issues that are either absent or much less severe in older women, such as the possibility of early menopause and effects on fertility. While clinicopathologic differences point to underlying biological differences between breast tumours found in younger versus older women, limited studies have documented age-related changes at the molecular level. Cancer progression is impelled by the accumulation of somatic genetic mutations, which consists of single nucleotide substitutions, translocations and somatic mutations [8]. Somatic mutations are non-heritable alterations to the human genome that occur spontaneously in somatic cells, which is often due to DNA replication error or chemical/ultraviolet (UV) radiation. Copy number alterations (CNA) are somatic changes in the copy numbers of a DNA sequence that arise during the process of cancer development. They consist of changed chromosome structure in the form of gain or loss in copies of DNA segments, and are prevalent in many types of cancer [9]. Investigating these genomic alterations in breast cancer patients can not only offer valuable insights into breast cancer pathogenesis and discover potential biomarkers, but also provide novel drug targets for better therapeutic treatment options [10]. Several cytogenetic and array-based studies have detected recurrent alterations linked with certain cancer types, and have found CNAs to be a particularly common genetic mutation in cancer [11, 12]. In addition, some of these CNAs have resulted in the discovery of disease causal genes and novel therapeutic targets, and have been strongly associated with clinical phenotypes [13-16]. For example, the use of vemurafenib to inhibit BRAF V600E mutation has shown remarkably improved survival in melanoma patients [17]. In another study, treatment with tyrosine kinase inhibitors for EGFR in lung cancer has also shown great success [18]. Since CNAs often encompass genes, it is suspected that they may greatly influence gene expression within the CNA regions. Indeed, several studies have reported a correlation between CNA and the average global expression levels of genes located within the copy number variable chromosomal regions. For instance, one group has shown that in tumour formation from an immortalized prostate epithelial cell line, 51% of genes with increased expression were mapped to DNA gain regions and 42% of genes with decreased expression were mapped to DNA loss regions [19]. This was further supported by another group working with breast tumour cell lines, noting that DNA copy number influences gene expression across a range of CNAs, with 62% of amplified genes resulting in moderately or highly elevated expression of the genes within the amplified regions [20]. Therefore, investigation of CNAs offers the potential to gain insight into the underlying genetic composition of breast tumours in young women. Mining genome-wide profiles will help find breast cancer genes and pathways with strong potential for prognostic significance as a function of age. Given that approximately 40–50% of young breast cancer patients relapse after 5 years [21], these age-specific signatures could also serve as a treatment decision tool to identify young patients that would gain more benefit from particular adjuvant therapies.

RESULTS

Clinical characteristics

The young patients with breast cancer in the Discovery and Validation Data sets retrieved from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [22] have very similar distribution in age, menopausal status, tumour grade, tumour size, ER, PR expression and HER2 expressions (p > 0.05) (Table 1). On the other hand, the two sets have statistically significant differences in the tumour stage, with young patients in the Discovery set having a much higher prevalence in stage 0 compared to the Validation set (43.1% vs 0.8%) and PAM50 subtypes (p < 0.05). However, an overall pattern of the basal subtype being the most frequent amongst young patients is apparent in both the Discovery and Validation dataset. It must be indicated that there are 50 patients in Validation set without stage information, which may affect the analysis of difference in distribution of stages between the two sets. Since our focus is only on those Discovery set CNA candidates that are validated in the Validation set, the stage difference is unlikely to be driving CNA selection. Furthermore, it is our intention to investigate whether tumours in young women share commonalities in genetic alterations, regardless of stages and subtypes.
Table 1

Clinical characteristics table comparing the METABRIC discovery dataset and validation dataset for young patients only

CharacteristicDiscovery YoungValidation YoungP-value
Age*40 (36, 43)40 (37, 43)1
Menopausal Status0.5
Pre127 (97.7%**)125 (100%)
Post2 (1.5%)0 (0.0%)
Subtype<0.001
Normal11 (8.5%)25 (20%)
LumA41 (31.5%)18 (14.4%)
LumB20 (15.4%)9 (7.2%)
Her216 (12.3%)21 (16.8%)
Basal42 (32.3%)52 (41.6%)
Grade0.98
17 (5.4%)6 (4.8%)
237 (28.5%)34 (27.2%)
386 (66.1%)81 (64.8%)
Stage<0.001
056 (43.1%)1 (0.8%)
125 (19.2%)28 (22.4 %)
242 (32.3%)35 (28.0%)
37 (5.4%)11 (8.8%)
40 (0.0%)0 (0.0%)
ER-expr0.09
+74 (56.9%)57 (45.6%)
56 (43.1%)68 (54.4%)
PR-expr0.78
+55 (42.3%)56 (44.8%)
75 (57.7%)69 (55.2%)
Her2-expr0.27
+22 (16.9%)29 (23.2%)
108 (83.1%)96 (76.8%)
Tumour Size* (mm)22 (16,30)(17,30)1

*For continuous variables (Age, Tumour size), quantiles (50th percentile (25th percentile, 75th percentile)) were presented.

†P-values were determined by Wilcoxon rank sum test for continuous variables and Fisher’s exact test for categorical variables.

**The proportion was obtained by dividing the total number of patients.

*For continuous variables (Age, Tumour size), quantiles (50th percentile (25th percentile, 75th percentile)) were presented. †P-values were determined by Wilcoxon rank sum test for continuous variables and Fisher’s exact test for categorical variables. **The proportion was obtained by dividing the total number of patients.

Identification of recurrent CNA regions

Figure 1 shows the analysis flowchart to identify age-related recurrent CNA regions using our maximal clique-based recurrent CNA detection algorithm. In the METABRIC Discovery cohort 867 of the total 997 patients are classified into the old age group (≥45 years old) and 130 patients into the young age group. In the Validation cohort 870 of the total 995 patients are classified into the old age group and 125 patients into the young age group. After applying filtering criteria (retaining CNA data that was generated by ≥10 probes and having a CNA size of at least 1 kb), for the old age cohort in the Discovery set, there are 96,503 and 47,943 individual patient level CNA gain and loss regions respectively. For the young age cohort, there are 14,957 and 6,373 individual patient level CNA gain and loss regions, respectively.
Figure 1

Analysis flowchart for identifying recurrent CNA regions

Recurrent CNA regions are identified from young and old patient cohorts in the Discovery set of METABRIC. The identified recurrent CNA regions are then validated in the Validation set of METABRIC.

Analysis flowchart for identifying recurrent CNA regions

Recurrent CNA regions are identified from young and old patient cohorts in the Discovery set of METABRIC. The identified recurrent CNA regions are then validated in the Validation set of METABRIC. Upon filtering for recurrent CNA regions of at least 1 kilobase (kb) in size and having at least 5 patients per recurrent region identified from the recurrent CNA calling algorithm, there are a total of 1,086 recurrent CNA gain regions (554 of the 1,086 gain regions encompassing protein encoding genes) and 439 recurrent CNA loss regions (202 of the 439 loss regions encompassing protein encoding genes). These regions are uniquely found in the young age group and form the young-specific recurrent gain and loss regions in the Discovery set. Validation testing is then performed using the Validation set, which contains 995 patients. All filtering criteria and algorithm implementations follow the same procedure as the Discovery dataset analysis. For recurrent CNA gain regions, a total of 81 of the 1,086 regions have been validated (found in both the Discovery and Validation datasets), in which 30 regions have encompassed 29 unique protein encoding genes (Table 2). For recurrent CNA loss regions, a total of 25 of the 439 regions have been validated, in which 8 regions encompassed 10 unique protein encoding genes (Table 3). In total, 38 validated recurrent CNA regions with 39 protein encoding genes were identified, along with 51 validated recurrent gain CNA regions (Supplementary Table 1) and 17 validated recurrent loss CNA regions (Supplementary Table 2) that did not encompass any protein encoding genes.
Table 2

Validated recurrent gain CNA regions with genes

ChrInner StartInner EndInner SizeOuter StartOuter EndOuter SizeGene SymbolSize1Size2
18455164084565561139218448119084729446248256SAMD1356
114360780214360903412321436070671436090551988PDE4DIP2030
11913742901913855771128719135952919140518345654CDC734050
11913857971914020041620719135952919140518345654CDC734050
122200431522200492561022199085922200552614667CAPN24847
3176423680176427601392117641539717642870513308NAALADL21822
31764286071764287059817641539717647083255435NAALADL21822
522246497223468031003062219450322414425219922CDH121211
634625387346349979610346249073465651631609SPDEF99
713478246113478703845771347824611347922919830CNOT41113
714215084414215423033861421508191421545153696PRSS1710
84069507140697114204340693570406997956225ZMAT41716
99316619493261927957339315733393373420216087NFIL357
10573799057422264236573676757441587391ASB132432
10145983411460056622251447710614629604152498FAM107B1927
11493174149328341093493174149329661225MMP26; OR51A2711
121807971916141081718079719519714400SLC6A121424
12789569378977742081789569378977742081SLC2A141323
127899067790508260157899067790959310526SLC2A141323
13112354883112363586870311233343411236358630152C13orf35107
13113356126113365589946311334503611337199826962ATP4B109
1743751830437533511521437221854375671734532SKAP1914
174513667645139395271945134175451422428067SLC35B12118
18954992595753132538894170069594232177226PPP4R157
184370400143707399339843703934437073993465SMAD266
19406294394064791818479406206404070249981859FFAR21313
19608909176090141010493608909176090485913942EPN1811
201474141614743670225414741416147437542338MACROD2810
2041215727412194533726412028184122057817760PTPRT1218
22201466922017059623904201458672017076624899PI4KAP2; TMEM191C715

The first seven columns represent the chromosome number, inner and outer start and end coordinates of the recurrent CNA region, and the size of the region in base pairs (hg18). The last three columns are the genes encompassed in each CNA region, followed by the sample size (no. of cases) in both Discovery and Validation dataset (Chr: Chromosome; Size 1: Discovery Cluster Size; Size 2: Validation Cluster Size).

Table 3

Validated recurrent loss CNA regions with genes

ChrInner StartInner EndInner SizeOuter StartOuter EndOuter SizeGene SymbolSize1Size2
 2975071809751747610296975071809752069813518ANKRD36B55
 3622435386225752313985622426066227751634910PTPRG66
 45952161566204559521644354914ZNF718; ZNF595136
 73829634338297866152338295506382979392433TRGV111822
 81438885114391732288114385622143917326110SGCZ2318
 9502745450293421888502745450303342880JAK2711
108971011489713882376889708179897138825703PTENP1; PTEN1010
1721245986212538167830212270312127121044179KCNJ12159

The first seven columns represent the chromosome number, inner and outer start and end coordinates of the recurrent CNA region, and the size of the region in base pairs (hg18). The last three columns are the genes encompassed in each CNA region, followed by the sample size in both Discovery and Validation dataset (Chr: Chromosome; Size 1: Discovery Cluster Size; Size 2: Validation Cluster Size).

The first seven columns represent the chromosome number, inner and outer start and end coordinates of the recurrent CNA region, and the size of the region in base pairs (hg18). The last three columns are the genes encompassed in each CNA region, followed by the sample size (no. of cases) in both Discovery and Validation dataset (Chr: Chromosome; Size 1: Discovery Cluster Size; Size 2: Validation Cluster Size). The first seven columns represent the chromosome number, inner and outer start and end coordinates of the recurrent CNA region, and the size of the region in base pairs (hg18). The last three columns are the genes encompassed in each CNA region, followed by the sample size in both Discovery and Validation dataset (Chr: Chromosome; Size 1: Discovery Cluster Size; Size 2: Validation Cluster Size). Figure 2 shows an overview of how similar the cluster sizes (i.e. number of patients) are in the Discovery set versus the Validation set for all the identified young-specific recurrent CNA regions. It can be seen that for both gain and loss regions, cluster sizes in the Discovery and Validation datasets have a fairly linear relationship. For example, if 30% of the young patients in the Discovery set harbour a CNA region, it is likely that around 30% of patients in the Validation set will harbour that region as well.
Figure 2

Scatter plot showing the cluster sizes of recurrent CNA regions in the discovery and validation sets

(A) Gain recurrent young-specific regions and (B) Loss recurrent young-specific regions. Each point on the plot represents a young-specific recurrent CNA region. Blue represents regions without genes, and orange represents regions with genes.

Scatter plot showing the cluster sizes of recurrent CNA regions in the discovery and validation sets

(A) Gain recurrent young-specific regions and (B) Loss recurrent young-specific regions. Each point on the plot represents a young-specific recurrent CNA region. Blue represents regions without genes, and orange represents regions with genes.

Annotation of the identified recurrent CNA regions

We performed region-based variation annotation on the identified young-specific recurrent CNA regions (see Tables 2 and 3 and Supplementary Tables 1 and 2) with refGene using the software ANNOVAR (Annotate Variation). The complete annotation information of the recurrent CNA regions is shown in Supplementary Table 3. Figure 3 shows the genome location distribution of our recurrent CNAs with respect to the encompassed regions. The majority of the CNAs are in non-coding regions (76%) and 24% in coding regions.
Figure 3

Distribution of the identified young-specific recurrent CNA regions with respect to the genome location

The functional annotation of the young-specific recurrent CNA regions is based on software ANNOVAR.

Distribution of the identified young-specific recurrent CNA regions with respect to the genome location

The functional annotation of the young-specific recurrent CNA regions is based on software ANNOVAR. In order to better visualize the mutation distribution of the 39 genes encompassed in the recurrent CNAs identified in the coding regions in both the Discovery and Validation young women group, an R package called the ComplexHeatmap was applied (Figure 4). From the heatmap, it can be observed that CNA gain regions encompassing genes CAPN2, CDC73 and ASB13 are the top 3 most frequent in both Discovery and Validation dataset (young women age group), while gene SGCZ ranks top for recurrent CNA loss regions in the two datasets.
Figure 4

Heatmap of mutation distribution for genes identified in the recurrent young-specific CNA gain and loss regions

(A) Results from the Discovery dataset and (B) Results from the Validation dataset. Rows are sorted based on the frequency of the alterations in all young-specific samples and columns are sorted to visualize the mutual exclusivity across genes. Barplots at both sides of the heatmap show numbers of different alterations for each sample and for each gene. Red represents CNA loss mutations and blue represents CNA gain mutations.

Heatmap of mutation distribution for genes identified in the recurrent young-specific CNA gain and loss regions

(A) Results from the Discovery dataset and (B) Results from the Validation dataset. Rows are sorted based on the frequency of the alterations in all young-specific samples and columns are sorted to visualize the mutual exclusivity across genes. Barplots at both sides of the heatmap show numbers of different alterations for each sample and for each gene. Red represents CNA loss mutations and blue represents CNA gain mutations.

Expression quantitative trait locus analysis

An overview of the expression levels for each of the identified young-specific genes across all the young patients samples in the Discovery (Figure 5A) and Validation datasets (Figure 5B) is provided as gene expression heatmaps. Further interrogation using logistic regression was performed to evaluate the statistical association between gene expression and CNA mutation status (Table 4). In total, 16 gain regions and 1 loss region show significant associations with their gene expression changes. However, the directionality of the association is ambiguous. Fourteen out of the 16 gain regions correlated with high gene expression while the other 2 gain regions (encompassing MMP26 and SPDEF) were associated with low gene expression. For example, mutated gain CNA status in ASB13 seems to lead to higher gene expression. On the other hand, the loss region encompassing PTEN was found to be associated with having high gene expression level.
Figure 5

Heatmap of gene expression for the genes identified in the recurrent young-specific CNA regions in young breast cancer patients

(A) Results from the Discovery dataset (B) Results from the Validation dataset. Rows represent the gene expression levels for the genes identified in the recurrent young-specific CNA gain and loss regions (same order as in Figure 4 for comparison). Columns represent the young-specific samples in Discovery and Validation datasets. The higher the intensity of the red colour, the higher the gene expression level.

Table 4

Logistic regression analysis between CNA mutation status and gene expression in combined dataset

Gene SymbolCopy Number StateP-valueOdds RatioDiscovery Sample SizeValidation Sample Size
ASB13Gain0.0491.83 (1.00–3.34)2432
ATP4BGain0.0213.51 (1.21–10.17)109
CAPN2Gain0.000023.21 (1.89–5.45)4847
CDH12Gain0.0262.87 (1.14–7.24)1211
CNOT4Gain0.00047.45 (2.46–22.49)1113
EPN1Gain0.0049.57 (2.15–42.56)811
PDE4DIPGain0.0022.73 (1.45–5.15)2030
PI4KAP2Gain0.0312.81 (1.10–7.14)715
PPP4R1Gain0.0166.7 (1.44–31.23)57
SLC35B1Gain0.00119.19 (6.54–56.27)2118
SMAD2Gain0.0226.06 (1.30–28.22)66
SPDEFGain0.0570.39 (0.15–1.03)99
FAM107BGain0.0831.79 (0.93-3.46)1927
MACROD2Gain0.0852.73 (0.87-8.55 )810
MMP26Gain0.0980.43 (0.15-1.17)711
NFIL3Gain0.0683.46 (0.91-13.08)57
PTENLoss0.00229.12 (3.83–221.25)1010

†Odds ratio is followed by its corresponding 95% confidence interval in brackets.

Heatmap of gene expression for the genes identified in the recurrent young-specific CNA regions in young breast cancer patients

(A) Results from the Discovery dataset (B) Results from the Validation dataset. Rows represent the gene expression levels for the genes identified in the recurrent young-specific CNA gain and loss regions (same order as in Figure 4 for comparison). Columns represent the young-specific samples in Discovery and Validation datasets. The higher the intensity of the red colour, the higher the gene expression level. †Odds ratio is followed by its corresponding 95% confidence interval in brackets.

Survival analysis

We further evaluate whether the expression levels of these genes are associated with disease-specific survival (DSS) (Table 5). The expression levels of eight out of the 39 young-specific genes are significantly associated with survival outcome. A higher gene expression of genes CAPN2, NFIL3 and SLC35B1 was associated with a moderately worse survival outcome.
Table 5

Cox proportional hazard analysis of disease (breast cancer) specific survival of gene expression in combined dataset (discovery and validation)

GenesCNA typesP-valueHazard ratioDiscovery Sample SizeValidation Sample SizeChrInnerStartInnerEnd
ASB13Gain0.00010.54 (0.39–0.73)24321057379905742226
CAPN2Gain0.0911.54 (0.93–2.54)48471222004315222004925
NFIL3Gain0.0091.58 (1.13–2.23)5799316619493261927
PDE4DIPGain0.0270.37 (0.16–0.89)20301143607802143609034
PTPRTGain0.00020.68 (0.55–0.83)1218204121572741219453
SKAP1Gain0.00020.66 (0.53–0.82)914174375183043753351
SLC35B1Gain0.000051.88 (1.39–2.55)2118174513667645139395
JAK2Loss0.0070.43 (0.24–0.79)711950274545029342

†Hazard ratio is followed by its corresponding 95% confidence interval in brackets.

†Hazard ratio is followed by its corresponding 95% confidence interval in brackets. Of particular interest, the mutation status of two genes, ASB13 (Figure 6A) and SGCZ (Figure 7D), was also significant in the Kaplan Meier survival analysis, which allows estimation of a survival curve over time. Patients with a mutated status in both of these genes resulted in a worse survival outcome when compared to patients without the gene mutations. Other genes found to be significant in the survival analysis include ATP4B (Figure 6B), FFAR2 (Figure 6C) and PTPRT (Figure 6D), all encompassed within CNA gain regions. PTENP1 (Figure 7A), PTEN (Figure 7B), ZNF718 (Figure 7C) and ZNF595 (Figure 7E), all encompassed in CNA loss regions.
Figure 6

Kaplan-Meier survival analysis for genes with significant CNA gain mutations in the young women group

Genes showing with statistical significance (p < 0.05) are (A) ASB13, (B) ATP4B, (C) FFAR2 and (D) PTPRT. Survival curve in red represents patients without CNA mutation in the corresponding gene (CN = 2) while the curve in blue represents patients with CNA gain mutations in the corresponding gene (CN > 2). Y-axis is the cumulative survival probability and X-axis is the survival time in years.

Figure 7

Kaplan-Meier survival analysis for genes with significant CNA loss mutations in the young women group

Genes showing statistical significance (p < 0.05) are (A) PTENP1, (B) PTEN, (C) ZNF718, (D) SGCZ and (E) ZNF595. Survival curve in red represents patients without CNA mutation in the corresponding gene (CN = 2) while the curve in blue represents patients with CNA loss mutations in the corresponding gene (CN < 2). Y-axis is the cumulative survival probability and X-axis is the survival time in years.

Kaplan-Meier survival analysis for genes with significant CNA gain mutations in the young women group

Genes showing with statistical significance (p < 0.05) are (A) ASB13, (B) ATP4B, (C) FFAR2 and (D) PTPRT. Survival curve in red represents patients without CNA mutation in the corresponding gene (CN = 2) while the curve in blue represents patients with CNA gain mutations in the corresponding gene (CN > 2). Y-axis is the cumulative survival probability and X-axis is the survival time in years.

Kaplan-Meier survival analysis for genes with significant CNA loss mutations in the young women group

Genes showing statistical significance (p < 0.05) are (A) PTENP1, (B) PTEN, (C) ZNF718, (D) SGCZ and (E) ZNF595. Survival curve in red represents patients without CNA mutation in the corresponding gene (CN = 2) while the curve in blue represents patients with CNA loss mutations in the corresponding gene (CN < 2). Y-axis is the cumulative survival probability and X-axis is the survival time in years.

PTEN (Phosphatase and tensin homolog)

Results from our study show that the median survival time (i.e. half of the patients are expected to be alive) for young patients with a copy number loss in the PTEN gene region is ~4 years as opposed to ~15 years for those without. PTEN (cytoband 10q23.31) has been identified as a tumour suppressor which inhibits the PI3K/Akt/mTOR signalling pathways [23]. It has been shown to be one of the most frequently mutated genes in all cancer types, including that of breast, ovary, prostate, glioblastoma and lymphoma. Previous studies have observed that 40% of invasive breast cancers have a loss of PTEN heterozygosity, and that the loss of one gene copy is sufficient to disrupt cell signalling and cell growth control. It has also been suggested that carriers of the PTEN mutation are at higher risk of developing breast cancers at a younger age [24].

SGCZ (sarcoglycan zeta)

Our study shows that ~16% of all young patients present a CNA loss mutation encompassing SGCZ, with a significantly shorter median survival time for young patients with this mutation of ~6 years in contrast to ~15 years for those without. SGCZ (8p22) encodes a protein that is part of the sarcoglycan complex, which plays a role in connecting the inner cytoskeleton to the extracellular matrix, possibly maintaining membrane stability [25]. Although the exact function of SGCZ in cancer is not well understood, loss of the chr8p region has been associated with several factors involved in cancer development and progression, such as the tumour having an aggressive histology, increased cell proliferation, and large size as well as the patients having increased early recurrence rate and mortality, and overall poor survival in young women. This region also contains the gene DLC1 (deleted in liver cancer 1), which has been suggested to act as a tumour suppressor [26]. DLC1 encodes a GAP protein that inhibits the activation of Rho-GTPases, which are often associated with a loss of cell adhesion. DLC1 expression has been reported to be frequently lost in tumour cells, leading to a constitutive activation of the Rho-GTPases.

CAPN2 (calpain 2)

CAPN2 (cytoband 1q41) was the most frequent CNA gain mutation in our study, with ~37-38% of all young patients harbouring a CAPN2 gain mutation. Calpains are calcium-activated intracellular proteases that have the ability to cleave cytoskeletal proteins, possibly playing a role in regulating cell invasion and migration [27]. A knockdown study of CAPN2 in breast tumour cells resulted in reduced cell migration, proliferation, as well as reduced Akt activation, increased FoxO nuclear localization and p27 expression [27]. It was suggested that CAPN2 promotes cell proliferation through the Akt-FoxO-p27 signalling pathway.

NAALADL2 (N-acetyl-L-aspartyl-L-glutamate peptidase-like 2)

Our study shows that ~16% of all young patients present a CNA gain mutation encompassing NAALADL2. NAALADL2 is a member of the NAALADase protein family which act as matrix metalloproteases and have the ability to alter the tumour environment. Microarray studies have shown that NAALADL2 is often overexpressed in prostate and colon cancers and stimulates a migratory and metastatic phenotype. A proposed mechanism is that since NAALADL2 has been found to be basal-localized, it may enhance interaction of tumour cells with the extracellular matrix surrounding the tumour and provide a mechanism for the tumour cells to escape [28]. Subsequent survival analysis shows that patients with NAALADL2 overexpression have a 45% chance of surviving up to 5 years as opposed to 93% for patients with low NAALADL2 expression. It remained prognostic for recurrence rate even after correction for clinical variables such as tumour stage and grade. Expression array analyses also associated its overexpression to changes in the epithelial-to-mesenchymal transition (EMT) and cell adhesion pathways.

Pathway enrichment analysis

A pathway enrichment analysis using the ANNOVAR gene list (174 genes) via the Enrichr REACTOME database reveals a significant overrepresentation of phospholipid signaling (MTMR14,PTEN,PIP4K2A) and adherens junction (CDH12, CDH18, CDH7) pathways (p < 0.05) in the identified young-specific recurrent CNA regions with genes. Both enriched pathways are highly relevant to cancer development and progression.

Phospholipid signaling

Aside from playing an important role in structural components, lipids also have a role in signalling processes [29, 30]. These lipid molecules aggregate to form lipid rafts as highly specific platforms for cell signalling, carrying signals from activated growth factor receptors to the intracellular machinery [31]. These receptors recruit signalling effectors that induce cell proliferation and reduce cell death, dysregulation of which contributes to cancer development and progression. The phosphatidylinositol (3,4,5)-trisphosphate molecule, also known as PIP3, is generated by PI3K and leads to activation of downstream signaling components. A well-known consequence is recruitment and activation of protein kinase Akt, which can phosphorylate a variety of substrates, which in turn activate cell growth, apoptosis and cell cycle processes. PIP3 is a substrate for phosphatase and tensin homologue (PTEN), which is required for dephosphorylation of PIP3 into PIP2, essential for inhibition of the AKT pathway. Dysregulation of these pathways is frequent in many cancer types.

Cell adhesion

Cellular adhesion plays a major role in maintaining the integrity of normal cell-cell connections, and disruption in this pathway has been strongly associated with metastasis in cancers. Adherens junctions, which are sites of intracellular signalling and anchoring, provide strong bonds between adjacent cell membranes. The molecular processes governing cell-cell adhesion are very finely controlled, since they inhibit epithelial-mesenchymal transition (EMT) that is normally present during embryogenesis and tissue repair. Characteristics of EMT include a loss in intercellular adhesion and enhancement of cell migration, leading to a more motile phenotype [32]. Notably, the adherens junctions are lost during the process of EMT, which increases the risk of cancer progression such as metastasis. In normal tissues, epithelial cells are tightly bound to one another. However, in advanced cancer, many epithelial tumour cells show loss of cell-cell adhesion and increased tissue invasion. Tumours featuring local spreading and invasion are suggested to have a more aggressive phenotype and be associated with a higher mortality rate of the patient. This phenomenon has been widely seen in various cancer types, including breast, colon, prostate, ovarian and other types of cancer [33].

CONCLUSIONS

Applying the graph-based algorithm to the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) breast cancer dataset, we have identified and validated 81 recurrent CNA gain regions and 25 validated recurrent CNA loss regions specific to young-Women’s breast cancers. As well, we have located the corresponding candidate protein encoding genes that are encompassed in these regions. The graph-based algorithm guarantees that the identified CNA regions are the most frequent and that the minimal regions have been delineated. Identification of molecular alterations associated with disease outcome may improve risk assessment and treatments for aggressive breast cancer, especially for young women. It can give new insights into the role of CNAs in cancer predisposition, development and progression as well as contribute to a more accurate and complete human cancer genome sequence reference. We hope that the results of this study will in the future, facilitate the development of screening methods for breast cancer biomarker discovery, especially in young women, as more prospective samples become available. Since CNAs are fairly large in size, in the future it would be interesting to characterize further the non-coding CNA regions we have identified and their role in regulating gene expression levels either in cis or trans.

MATERIALS AND METHODS

Data source

All breast cancer data are retrieved from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [22], which is a novel dataset consisting of comprehensive clinical features such as breast cancer-specific survival data, PAM50 subtypes, ER/PR/HER2 status, tumour grade and tumour sizes. Each case has corresponding whole gene expression profiles (Illumina HT-12 v3 platform), SNPs and somatic DNA copy-number profiling data (Affymetrix Human SNP 6.0 platform). Treatments for the patients are homogeneous among each clinically relevant group: almost all ER-positive/LN-negative patients did not receive chemotherapy, while ER-negative/LN-positive patients did receive chemotherapy. Furthermore, the METABRIC cohort consists of cohorts prior to the usage of Herceptin/trastazumab in standard clinical care. Therefore, the outcome of HER2 positive patients reflects the poor prognosis in such patients before the introduction of this targeted therapy [22]. All samples are derived from ~2,000 clinically annotated primary fresh-frozen breast cancer specimens from tumour banks in the UK and Canada (a discovery set of 997 primary tumours and a validation set of 995 tumours were divided by METABRIC). All genomic and clinically annotated data are available at the European Genome-Phenome Archive (http://www.ebi.ac.uk/ega/), under accession number EGAS00000000083 [22]. The individual CNA calls of the ~2,000 individual samples are pre-existing from the METABRIC study and downloaded from EGAS00000000083 [22]. Circular binary segmentation (CBS) method is used for making individual CNA calls. CBS is a segmentation-based method that scans for change points in an ordered sequence of copy number values to delineate segments with different distribution of the values (measured by having different means). In other words, it will recursively divide up the chromosome until segments that have probe distribution different than neighbours have been identified [41].

Representing CNAs as an interval graph

Figure 8A shows examples of five individual patient level CNA segments (CNA 1, CNA 2, CNA 3, CNA 4, CNA 5) on the same chromosome. Each of the five CNAs contains chromosomal-specific start (left) and end (right) positions. To identify the common regions of individual patient level CNAs on the same chromosome, the intersection among the individual patient level CNAs can be represented as an interval graph, treating each called individual patient level CNA as a vertex of the graph and connecting two vertices only if the corresponding intervals have an intersecting region. Thus, the constructed interval graph G(V, E) is comprised of a set of vertices V, where each vertex (v ϵ V) corresponds to a specific interval of the individual patient level CNA and each edge ({u, v} ϵ E) connects two intersecting intervals u and v. In Figure 8B, an example of the interval graph is shown where CNA 1 through CNA 4 are the intervals (nodes of the graph or individual patient level CNAs) and an edge connects two nodes (individual patient level CNAs) if the intervals overlap.
Figure 8

Representing CNAs as an interval graph

(A) CNA 1, CNA 2, CNA 3, CNA 4, CNA 5 are individual patient level CNAs on a specific chromosome. Each of the CNAs has chromosome start and end positions. (B) This is an interval graph where CNA 1, CNA 2, CNA 3, CNA 4, CNA 5 are the individual patient level CNAs in (A). The edge between each of two vertices in the graph represents the two individual patient level CNAs sharing a piece of common regions on the chromosome.

Representing CNAs as an interval graph

(A) CNA 1, CNA 2, CNA 3, CNA 4, CNA 5 are individual patient level CNAs on a specific chromosome. Each of the CNAs has chromosome start and end positions. (B) This is an interval graph where CNA 1, CNA 2, CNA 3, CNA 4, CNA 5 are the individual patient level CNAs in (A). The edge between each of two vertices in the graph represents the two individual patient level CNAs sharing a piece of common regions on the chromosome. To find maximal cliques in an interval graph constructed from individual patient level CNAs, we applied Gentlemen and Vandal’s algorithm [34]. The main idea of the algorithm is to sort the vertices based on their chromosomal end positions. The ordering is important because it allows the algorithm to discard vertices in each iteration without losing the triangulation property. The input of the algorithm is the individual patient level CNAs on a specific chromosome, which includes two parameters for each CNA segment: start and end positions (base pair). Each of the identified maximal cliques is a recurrent CNA, which is common in multiple patients. The shared region of the recurrent CNA across multiple patients is the minimal common region (MCR) of the CNA, which has the potential to harbour cancer causing/associated genes. In practice, the size of the maximal cliques should be at least 2 and the size of the MCRs should be at least 1kb. It should be noted that we need to analyse CNA gains and losses separately. More details of the algorithm and its applications can be found in [35]. Disease (breast cancer) specific survival analysis was performed for both the mutation status (CNA gain, CNA loss) by the product-limit method or The Kaplan-Meier method and the expression level of the corresponding genes that are encompassed in the validated recurrent CNA regions using the Cox proportional hazard model [36]. All analyses were performed using Survival R package (https://cran.r-project.org/web/packages/survival/index.html).

eQTL analysis

An expression quantitative trait locus (eQTL) is a locus that explains a portion of the genetic variance of a gene expression phenotype. An eQTL analysis tests for direct associations between markers of genetic variation with gene expression levels; that is, to evaluate the association between gene expression and CNA mutation status. Logistic regression is used to estimate the probability p associated with a dichotomous response for various values of an explanatory variable. In this case, the response (dependent) variable is gene expression (binarized-by-mean) and the predictor (independent) variable is CNA status.

Functional analysis

Functional analysis such as enrichment and annotations have been carried out using software (Enrichr and ANNOVAR) to determine whether the identified CNA regions with protein coding genes are enriched in any interesting pathways or functions. Enrichr software [37] contains a diverse and up-to-date collection of over 100 gene-set libraries available for analysis and download. It is used to perform pathway enrichment analysis on the identified young-specific genes to identify which pathways are over-represented in the gene-set. ANNOVAR [38] is a perl command line program for genome annotation. This region-based annotation is used to identify affected genomic regions that lie outside of the protein-coding regions.

Biological visualization

In order to aid in clearer visualization of and assist interpretation of the results, software programs Oncoprint [39] and CIMminer [40] were used to generate heatmap visualizations for the identified candidate regions. Oncoprint is included in the R package ComplexHeatMap, and it is a way to visualize multiple genomic alteration events in the format of a heatmap. This is used to visualize the frequencies of CNA mutation for each of the young-specific regions with genes in Discovery and Validation datasets. CIMminer generates color-coded Clustered Image Maps (CIMs) to portray “high-dimensional” data sets such as gene expression profiles. It is used to visualize the relative expression levels in terms of colour intensity for each of the identified young-specific genes.
  39 in total

1.  Epithelial-mesenchymal transition, the tumor microenvironment, and metastatic behavior of epithelial malignancies.

Authors:  Lindsay J Talbot; Syamal D Bhattacharya; Paul C Kuo
Journal:  Int J Biochem Mol Biol       Date:  2012-05-18

2.  Atypical PKCiota contributes to poor prognosis through loss of apical-basal polarity and cyclin E overexpression in ovarian cancer.

Authors:  Astrid M Eder; Xiaomei Sui; Daniel G Rosen; Laura K Nolden; Kwai Wa Cheng; John P Lahad; Madhuri Kango-Singh; Karen H Lu; Carla L Warneke; Edward N Atkinson; Isabelle Bedrosian; Khandan Keyomarsi; Wen-lin Kuo; Joe W Gray; Jerry C P Yin; Jinsong Liu; Georg Halder; Gordon B Mills
Journal:  Proc Natl Acad Sci U S A       Date:  2005-08-22       Impact factor: 11.205

3.  Elucidating prognosis and biology of breast cancer arising in young women using gene expression profiling.

Authors:  Hatem A Azim; Stefan Michiels; Philippe L Bedard; Sandeep K Singhal; Carmen Criscitiello; Michail Ignatiadis; Benjamin Haibe-Kains; Martine J Piccart; Christos Sotiriou; Sherene Loi
Journal:  Clin Cancer Res       Date:  2012-01-18       Impact factor: 12.531

4.  An information-intensive approach to the molecular pharmacology of cancer.

Authors:  J N Weinstein; T G Myers; P M O'Connor; S H Friend; A J Fornace; K W Kohn; T Fojo; S E Bates; L V Rubinstein; N L Anderson; J K Buolamwini; W W van Osdol; A P Monks; D A Scudiero; E A Sausville; D W Zaharevitz; B Bunow; V N Viswanadhan; G S Johnson; R E Wittes; K D Paull
Journal:  Science       Date:  1997-01-17       Impact factor: 47.728

5.  Molecular portraits of human breast tumours.

Authors:  C M Perou; T Sørlie; M B Eisen; M van de Rijn; S S Jeffrey; C A Rees; J R Pollack; D T Ross; H Johnsen; L A Akslen; O Fluge; A Pergamenschikov; C Williams; S X Zhu; P E Lønning; A L Børresen-Dale; P O Brown; D Botstein
Journal:  Nature       Date:  2000-08-17       Impact factor: 49.962

Review 6.  Linking somatic genetic alterations in cancer to therapeutics.

Authors:  Darrin Stuart; William R Sellers
Journal:  Curr Opin Cell Biol       Date:  2009-03-26       Impact factor: 8.382

7.  Supervised risk predictor of breast cancer based on intrinsic subtypes.

Authors:  Joel S Parker; Michael Mullins; Maggie C U Cheang; Samuel Leung; David Voduc; Tammi Vickery; Sherri Davies; Christiane Fauron; Xiaping He; Zhiyuan Hu; John F Quackenbush; Inge J Stijleman; Juan Palazzo; J S Marron; Andrew B Nobel; Elaine Mardis; Torsten O Nielsen; Matthew J Ellis; Charles M Perou; Philip S Bernard
Journal:  J Clin Oncol       Date:  2009-02-09       Impact factor: 44.544

8.  Patient and tumor characteristics associated with increased mortality in young women (< or =40 years) with breast cancer.

Authors:  Ankit Bharat; Rebecca L Aft; Feng Gao; Julie A Margenthaler
Journal:  J Surg Oncol       Date:  2009-09-01       Impact factor: 3.454

9.  Restoration of DLC-1 gene expression induces apoptosis and inhibits both cell growth and tumorigenicity in human hepatocellular carcinoma cells.

Authors:  Xiaoling Zhou; Snorri S Thorgeirsson; Nicholas C Popescu
Journal:  Oncogene       Date:  2004-02-12       Impact factor: 9.867

10.  A Novel Graph-based Algorithm to Infer Recurrent Copy Number Variations in Cancer.

Authors:  Chen Chi; Rasif Ajwad; Qin Kuang; Pingzhao Hu
Journal:  Cancer Inform       Date:  2016-10-09
View more
  4 in total

1.  Computational identification and analysis of early diagnostic biomarkers for kidney cancer.

Authors:  Tang Tang; Xiaoyan Du; Xiaoyi Zhang; Wenling Niu; Chunhua Li; Jianjun Tan
Journal:  J Hum Genet       Date:  2019-07-26       Impact factor: 3.172

2.  ASB13 inhibits breast cancer metastasis through promoting SNAI2 degradation and relieving its transcriptional repression of YAP.

Authors:  Hanqiu Zheng; Yibin Kang; Huijuan Fan; Xuxiang Wang; Wenyang Li; Minhong Shen; Yong Wei
Journal:  Genes Dev       Date:  2020-09-17       Impact factor: 11.361

3.  Identification of significantly mutated subnetworks in the breast cancer genome.

Authors:  Rasif Ajwad; Michael Domaratzki; Qian Liu; Nikta Feizi; Pingzhao Hu
Journal:  Sci Rep       Date:  2021-01-12       Impact factor: 4.379

4.  miR‑383 increases the cisplatin sensitivity of lung adenocarcinoma cells through inhibition of the RBM24‑mediated NF‑κB signaling pathway.

Authors:  Bo He; Chao Wu; Weichao Sun; Yang Qiu; Jingyao Li; Zhihui Liu; Tao Jing; Haidong Wang; Yi Liao
Journal:  Int J Oncol       Date:  2021-09-24       Impact factor: 5.650

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.