Literature DB >> 23884293

Novel insights into breast cancer genetic variance through RNA sequencing.

Anelia Horvath¹, Suresh Babu Pakala, Prakriti Mudvari, Sirigiri Divijendra Natha Reddy, Kazufumi Ohshiro, Sandra Casimiro, Ricardo Pires, Suzanne A W Fuqua, Masakazu Toi, Luis Costa, Sujit S Nair, Saraswati Sukumar, Rakesh Kumar.

Abstract

Using RNA sequencing of triple-negative breast cancer (TNBC), non-TBNC and HER2-positive breast cancer sub-types, here we report novel expressed variants, allelic prevalence and abundance, and coexpression with other variation, and splicing signatures. To reveal the most prevalent variant alleles, we overlaid our findings with cancer- and population-based datasets and validated a subset of novel variants of cancer-related genes: ESRP2, GBP1, TPP1, MAD2L1BP, GLUD2 and SLC30A8. As a proof-of-principle, we demonstrated that a rare substitution in the splicing coordinator ESRP2 (R353Q) impairs its ability to bind to its substrate FGFR2 pre-mRNA. In addition, we describe novel SNPs and INDELs in cancer relevant genes with no prior reported association of point mutations with cancer, such as MTAP and MAGED1. For the first time, this study illustrates the power of RNA-sequencing in revealing the variation landscape of breast transcriptome and exemplifies analytical strategies to search regulatory interactions among cancer relevant molecules.

Entities: Chemical

Mesh：

Substances：

Year: 2013 PMID： 23884293 PMCID： PMC3722564 DOI： 10.1038/srep02256

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.996

Breast cancer is the third most frequent cancer in the world as it affects approximately one in ten women in the western world1. The initial knowledge that connected breast cancer to genetic susceptibility originated from the clinical observations that highlighted the clustering of breast cancer cases in families23. Approximately 5–10% of breast cancers are believed to result from the inheritance of rare genetic components that confer significantly elevated risk45. For example, mutations in the tumor suppressor genes BRCA1 and BRCA2 account for approximately 16% of the familial breast cancer678. The vast majority of breast cancer cases, however, are derived from a complex interaction between multiple environmental, lifestyle and genetic factors with relatively weak individual risk contribution910. While the effects of many environmental and lifestyle factors, such as diet, reproductive behavior and radiation are well appreciated, the knowledge on genetically contributing patterns is limited. Association studies have identified ATM, BRIP1, CASP8, CDH1, CHEK2, PALB2, PTEN, STK11, and TP53 as breast cancer susceptibility genes. Such mutations collectively account for 2.3% of familial risk of breast cancer, and together with BRCA1, BRCA2 and others have been implicated in high risk screening strategies5811121314151617181920. Nonetheless, significant proportion of the familial and non-familial breast cancer susceptibility remains unknown, suggesting plethora of genetic elements that need to be understood. Transcriptome sequencing comprises a unique interplay between individual genetic background, reflected in the variation content, and the epigenetic and environmental regulation affecting gene expression levels and splice patterns. Recent transcriptome sequencing efforts have highlighted important somatic events in metastatic triple negative breast cancer (TNBC) and described important for the clinical outcome genotype-phenotype correlations21. Further, transcriptome sequencing data have been successfully explored to reveal disrupted pathways in TNBC through genome-wide loss of heterozygosity and mono-allelic expression estimation22. As a result of these and other studies, the feasibility of transcriptome sequencing to uncover molecular mechanisms of breast cancer drivers is increasingly appreciated23. Here we used whole transcriptome RNA-sequencing to reveal the variation signatures of 17 breast cancer patient tissues, and compared with human normal breast organoids (referred from here on as normal breast tissue, NBT). The 17 samples include six TNBC, lacking expression of therapeutically significant components - estrogen receptor (ER), progesterone-receptor (PR) and the Human Epidermal Growth Factor Receptor 2 (HER2); six Non-TNBC (ER, PR and HER2-positive); and five HER2-positive samples (ER and PR negative). Compared to the extensively performed searches for somatic breast cancer mutations, our RNA-sequencing based approach detects SNPs that are expressed at the mRNA level, and allows estimation of their allelic expression at nucleotide resolution. A set of novel variants were validated through Sanger sequencing. As a proof-of-principle, we have explored the effect of a rare SNP- p.R353Q - in the epithelial splicing regulatory protein ESRP2, on the binding and splicing of its target pre-mRNA. Our study reports a set of novel mutations in essential regulatory molecules in breast cancer and discusses their allelic preferential expression and potential involvement in breast cancer.

Results

Analytical strategies and overall variation landscape

We set out to define the transcribed variation profile of TNBC, Non-TNBC and HER2-positive breast cancer samples. To achieve this, we applied mRNA sequencing on 17 breast cancer samples from unrelated individuals as well as on three NBT samples on the Illumina HiSeq 2000 platform. The raw reads were aligned against Ensembl GRCh37.62 B (hg19) using TopHat24, and the variants were called using Samtools25. Prior to filtering, a total of 1,876,617 SNPs, 331,197 of which were novel, were called across all 17 breast cancer samples, and between 30,294 and 258,465 SNPs (average 110,389) were called in each individual sample (Supplementary Table 1). The overview of the workflow and the filtering strategy is presented in Figure 1. The SNP calls were separated into two groups – either reported in the databases (between 22,914 and 218,411 per sample, average 91,201), or novel. The previously reported SNPs, due to validation by at-least one independent group, were analyzed further without filtering. To increase the confidence in the calls of novel variants, we initially analyzed the SNP calling reads of 1,000 SNPs through Integrative Genomics Viewer (IGV) files, and 96 of the calls were tested by Sanger sequencing. Based on the findings of this pilot validation test, we set up filtering criteria retaining minimum false-positive and false-negative calls as follows: those supported by at least of three bidirectional reads with unique start position, minimum phred quality value of 20, mapping quality value (MQV) > 20, and presence in 3 or less different samples. To ensure that we were not missing any novel high prevalence SNPs among our samples, all the positions at which a novel SNP was called in 4 or more samples were visually examined through IGV before to be assigned as false positives – no novel SNPs called in 4 or more samples were identified. This filtering left us with between 60 and 1143 novel variants per sample (average 285). The transition to transvertion (Ts/Tv) ratio among the novel coding SNPs was 2.8 and aligns with previously reported values for human exome, thus increasing the confidence of our filtering algorithm2627.

Figure 1

Workflow and filtering overview.

Different filtering strategies were applied to the novel and the previously reported variants. The previously reported variants, due to validation by at-least one independent group, were analyzed further without filtering. The filtering criteria for the novel variants were set as follows: those supported by a minimum of three bidirectional reads with unique start position, minimum phred quality value of 20, mapping quality value (MQV) > 20, and presence in 3 or less different samples. All the positions at which a novel SNP was called in 4 or more samples were visually examined through IGV before assignment as false positives.

Prior to filtering, between 1,574 and 11,669 previously reported INDELs were called in each of the studied breast cancer samples and subjected to further analysis (See Supplementary Table 1). The novel INDELs were quality filtered to remove calls with MQV less than 20, phred quality value below 20 and presence in three or more different samples. This left between 18 and 142 novel INDELs (average 59) per sample, which were retained in our further analysis.

Expressed SNP density

To assess the overall expressed variation landscape of the breast cancer samples, we estimated the SNP density by counting the number of SNPs per megabase (MB) genome intervals. The SNP density was calculated individually for each sample and compiled per group (TNBC, non-TNBC and HER-2), and as a whole for the 17 samples (Figure 2). Overall, the SNP density distribution across the three groups was very similar, with a few regions showing group-specific high-density loci. All the TNBC samples presented with high number of SNPs in the region of chr14:10500000–106000000, which was mainly contributed by increased overall SNP number in the large gene encoding nucleoprotein AHNAK2. Specifically enriched in all non-TNBC samples was the region on chr19:53000000, mainly due to high number of SNPs in the zinc finger protein (ZNF) encoding genes.

Figure 2

Expressed SNP density expressed as number of SNPs per megabase (MB) genome intervals.

The SNP density was calculated individually for each sample and compiled per group (TNBC, non-TNBC and HER-2), and as a whole for the 17 samples. Overall, similar SNP density distribution is observed across the three groups. (A) Circos plot representing the high density expressed SNPs in TNBC, Non-TNBC and HER2 positive Breast Cancer Samples compared to cancer genome SNP data from COSMIC. (B) The highest SNP density loci for TNBC, Non-TNBC, HER2 and COSMIC, compared to the SNP density for the same locus for the other three groups. The highest SNP density for the COSMIC was observed in the interval chr2:48000000–48999999, containing the genes MSH6, FBX011, FOXN2, PPP1R21, STON1, GTF2A1L and LHCGR, while very low expressed SNP density for this region was measured in all three breast cancer subtypes.

We also overlapped the expressed overall SNP density in our samples with somatic genome SNP density calculated from the COSMIC dataset. There was a significant overlap in the overall SNP distribution. However, the regions with highest SNP frequency differed: while in the COSMIC dataset they were chr2:48000000, chr17:20000000 and chr5:72000000, the three top SNP-enriched regions when all of our samples were analyzed together were chr6:31000000–32000000, chr8:144000000 and chr19:53000000. While the high density observed on chromosome 6 was due to a the well-known variability in the histone cluster (HIST1H1A) and major histocompatibility complex (HLA), the chromosome 8 region was enriched by variants in epidermal antigen Epiplakin1 (EPPK1) and lymphocyte antigen (LY6E) (see Figure 2).

Comparative analysis with cancer genome variations (COSMIC)

We compared the SNPs identified in our samples with the COSMIC cancer genome somatic mutation database (http://www.sanger.ac.uk/cosmic)28. A total of 2,169 SNPs from the COSMIC database were found among our samples, 129 of which were present in more than 10 of the 17 samples, and 6 were called in all breast cancer samples. Only one SNP – the relatively common variant R1322X in the ABC transporter gene ABCA10 was nonsense, 515 were missense and 20 were located within a splice site. Of note, only two of the SNPs in our dataset, both UTR located, overlapped with COSMIC variants found in breast cancer: 1) the promoter T > C substitution in the proto-oncogene binding Yes-associated protein (YAP1) was seen in 8 of our 17 patients, and 2) the 3′-UTR C > T substitution in peptidylprolyl isomerase F (PPIF) was found in 5 of our samples. Among these comparisons, highly represented in our datasets were COSMIC missense variants in the DNA-repair encoding probable helicase senataxin SETX and Ewing's tumor-associated antigen ETAA1.

GWAS associated SNPs in the breast cancer transcriptome

To outline SNPs that have been previously associated with breast cancer phenotypes, we overlaid our datasets with the publically available genome wide association studies (GWAS); the results are summarized in Supplementary Table 2. The pre-B-cell leukemia homeobox 1 (PBX1) intronic SNP rs1387389 that has been reported to strongly associate with early onset breast cancer29 was present in 4/17 samples, two of which were homozygous. Similarly, two breast cancer associated SNPs in the fibroblast growth factor receptor (FGFR2)30, rs2420946 and rs2981582, were present each in two of our samples, (one patient was a carrier of both), again, in a homozygous state. Of note, the mitogen-activated protein kinase kinase kinase (MAP3K1) SNP reported by the same study was not present among our samples, however, we found a higher prevalence of the closely positioned D860N and V906I missense substitutions in MAP3K1; they were called in 13 (9 homozygous) and 16 (11 homozygous) of our samples, respectively. Similar high homozygocity prevalence was seen for the rs704010 rs8170, rs2180341, rs13281615, rs3817198 and rs4973768, but was not observed for the intergenic rs4415084. Other SNPs reported to be in strong association with breast cancer from recent meta-analyses2930313233 were not seen in our samples.

Variations in genes previously implicated in hereditary breast cancer

To search for known predisposing breast cancer variants among our samples, we extracted SNPs and INDELS in genes that have been previously associated with hereditary breast cancer. Among all 17 samples, 80 SNP calls (38 unique SNPs) and 66 INDEL calls (38 unique INDELs) mapped within ATM, BRCA1, BRCA2, BRIP1, CASP8, CDH1, CHEK1, PTEN, STK11 or TP53. While the majority of the SNPs called in those genes variants were common or have no known effect on the protein, several variants have been previously linked to breast cancer predisposition (Table 1). In BRCA1 and BRCA2 collectively, twelve different missense substitutions were identified in a total of nine patients. Five of the missense substitutions (p.Q356R, p.R496H, and p.D693N in BRCA1, and p.N289H and p.D1420Y in BRCA2) have been previously associated with breast cancer through either family or case-control studies3435. Three patients from the non-TNBC group were carriers of the missense variant p.Q356R, and the other cancer-associated variants were present in one patient each – from either non-TNBC or HER2 positive groups. The two BRCA2 missense substitutions were seen in HER2-positive patients. In addition, one TNBC patient carried a small BRCA1 deletion (chr17:41246251delC, rs80357794) that leads to a frame-shift and premature stop codon expected to completely abolish protein function. In the ATM gene, we identified the non-synonymous substitutions p.F858L and p.L1420F, which have been previously associated with increased risk for breast cancer36. The ATM missense p.F858L is known to impact the interaction of ATM with beta-adaptin, which is necessary for clatherin mediated receptor endocytosis and is proposed to contribute to the hereditary radio sensitivity and breast cancer3738. In addition to breast cancer-associated variants, a missense substitution in STK11, p.H175Y, previously reported in a lung carcinoma39 was been found in one patient. One non-TNBC patient carried simultaneously pathogenic variants in BRCA1 and ATM. Of note, overall a higher number of variant versus reference reads was assessed across BRCA1, BRCA2 and ATM variations.

Table 1

Variants in ATM, BRCA1, BRCA2 and STK11 identified in the 17 breast cancer samples (from HGMD, http://www.hgmd.cf.ac.uk/ac/index.php. The homo- or heterozygosity and the number of the unique variant and reference reads are also shown

Gene	Chromosomal Location	cDNA	Protein	rsID	Function	Cancer Associated	ID	Cancer subtype	Zygosity	Var/ref calls
ATM	chr11:108138003	c.2572 T > C	p.F858L	1000056	missense	YES	IP2-71	Non-TNBC	heterozygote	3/2
	chr11:108160350	c.4258 C > T	p.L1420F	1000058	missense	YES	171	HER2	heterozygote	9/10
	chr11:108175462	c.5557 G > A	p.D1853N	1801516	missense	NO	IP2-42	Non-TNBC	heterozygote	6/2
							IP2-49	Non-TNBC	homozygote	4/0
							56	HER2	heterozygote	10/3
							83	HER2	heterozygote	6/7
BRCA1	chr17:41246481	c.1067 T > C	p.Q356R	1799950	missense	YES	IP2-42	Non-TNBC	homozygote	2/0
							IP2-49	Non-TNBC	heterozygote	4/7
							IP2-66	Non-TNBC	heterozygote	3/6
	chr17:41246061	c.1487 C > T	p.R496H	28897677	missense	YES	IP2-71	Non-TNBC	heterozygote	4/4
	chr17:41245471	c.2077 C > T	p.D693N	4986850	missense	YES	83	HER2	homozygote	4/1
	chr17:41246251	c.1156delG	p.A386Pfs	80357794	indel	YES	IP2-78	TNBC	homozygote	4/0
	chr17:41244936	c.2612 G > A	p.P871E	799917	missense	NO	IP2-48	Non-TNBC	heterozygote	3/1
							IP2-49	Non-TNBC	homozygote	6/0
							IP2-66	Non-TNBC	heterozygote	6/3
							26	HER2	homozygote	20/0
							83	HER2	heterozygote	4/1
							171	HER2	homozygote	5/0
							IP2-78	TNBC	heterozygote	3/2
	chr17:41244435	c.3113 T > C	p.E1038G	1799966	missense	NO	IP2-48	Non-TNBC	homozygote	2/0
	chr17:41244000	c.3548 T > C	p.K1183R	16942	missense	NO	IP2-48	Non-TNBC	homozygote	3/0
							83	HER2	heterozygote	4/1
							171	HER2	homozygote	5/0
	chr17:41223094	c.4900 T > C	p.S1634G	1799966	missense	NO	IP2-48	Non-TNBC	homozygote	3/0
							IP2-66	Non-TNBC	homozygote	4/0
							171	HER2	homozygote	4/0
BRCA2	chr13:32906480	c.865 A > C	p.N289H	766173	missense	YES	56	HER2	heterozygote	3/2
	chr13:32912750	c.4258 G > T	p.D1420Y	766173	missense	YES	26	HER2	heterozygote	6/5
	chr13:32911463	c.2971 A > G	p.N991D	1799944	missense	NO	56	HER2	heterozygote	4/4
	chr13:32929387	c.7397 T > C	p.V2466A	169547	missense	NO	IP2-50	TNBC	homozygote	2/0
							26	HER2	homozygote	3/0
							56	HER2	homozygote	3/0
							83	HER2	homozygote	2/0
							171	HER2	homozygote	5/0
	chr13:32930673	c.7544 C > T	p.T2515I	28897744	missense	NO	IP2-66	Non-TNBC	homozygote	2/0
STK11	chr19:1220427	c.520 C > T	p.H174Y	0	missense	YES/lung	IP2-42	Non-TNBC	heterozygote	8/18

Prevalence of rare variants

To reveal variants that might be overrepresented in our samples compared to the general population, we compared the allele frequency of coding SNPs called in our samples against 11,666 alleles from the Exome Sequencing Project dataset (http://evs.gs.washington.edu/EVS/). To minimize error due to different variant calling platforms and to increase the statistical significance of the findings, we excluded from this analysis SNPs called in less than 10 alleles from the ESP dataset and in less than 3 individuals among our 17 samples. SNPs called in all 17 of our samples were also excluded. For the purposes of allele frequencies comparison, we assigned two alleles for every homozygote call in our dataset; and Yates corrected chi-square was calculated for each distribution. The top 50 most prevalent missense SNPs among the 17 samples are presented in Table 2. The highest difference in the allele distribution between our dataset and ESP was estimated for rs2305376 in the gene HOOK2, encoding a component of the FTS/Hook/FHIP (FHF) complex that has a role in vesicle trafficking and maintenance of centrosome function and is known to interact with the JUN proto-oncogene4041. Interestingly, the variant is predicted to be damaging change due to glycine to arginine substitution (p.G10R), which in addition to its low prevalence in the ESP datasets, is rare to absent in the European population datasets (see Table 2). This variant was called in 3 of our 17 samples, and all of them were called homozygous due to the high abundance of variant over wild type reads. Another overrepresented SNP in our dataset was the missense substitution p.T573A in the protein tyrosine phosphatase PTPN12, whose activity is lost in a large proportion of breast cancer cases42. Of note, while PTPN12 loss is most strongly associated with the TNBC phenotype43, we found this variant equally prevalent in all three breast cancer subtypes; one TNBC and one HER2-positive samples carried it in homozygote state. Other breast cancer implicated genes with prevalent variants amongst our samples were PLEC, PRCP, DSG2 and ERBB2IP, all harboring predicted to be damaging aminoacid changes44454647. Potential contribution of such variants to the phenotype in these patients is worth investigation.

Table 2

Comparison between the observed prevalence of known SNPs in the 17 studied breast cancer samples and Exome Sequencing Project – presented are the top 50 variants showing higher prevalence among the 17 samples. The variants are sorted according descending chi-square value. High Polyphen score indicates high probability of the variant to alter the protein function

	17 Breast cancer samples				ESP				Chi-Square Test				HapMap
Gene	#Samples	#Var Alleles	#Ref allelels	Var/Ref alleles	#Var Alleles	#Ref allelels	Var/Ref alleles	FREQ 17/FREQ ESP	value	P	Yates value	Yates P	Afr	Eur	Asia	rsID	AA change	Polyphen Score
HOOK2	3	6	28	0.21	8	12136	0.001	325.1	912	0	766	0	0	0	7.3	2305376	G10R	0.99
ANTXR1	3	3	31	0.10	21	12903	0.002	59.5	138	0	95	0	NA	NA	NA	28365986	R7K	0.009
MCOLN1	3	4	30	0.13	46	12960	0.004	37.6	115	0	88	0	NA	NA	NA	73003348	T261M	0.614
C3orf17	3	4	30	0.13	65	12925	0.005	26.5	82	0	62	0	NA	NA	NA	115971253	V297I	0.117
TNKS1BP1	5	6	28	0.21	182	12808	0.014	15.1	63	0	52	0	NA	NA	NA	34448143	A100P	0.194
PLEC	4	6	28	0.21	185	12391	0.015	14.4	59	0	49	0	NA	NA	NA	3135103	R569Q	0.993
DDX18	3	4	30	0.13	93	12913	0.007	18.5	56	0	42	0	NA	NA	NA	61755349	V371I	0.011
AP1M2	3	3	31	0.10	52	12238	0.004	22.8	54	0	37	0	NA	NA	NA	34276903	Y85C	0.998
GEMIN4	3	6	28	0.21	297	12201	0.024	8.8	33	0	27	0	NA	NA	NA	191778127	H873Q	0
LY75	5	8	26	0.31	658	12348	0.053	5.8	24	0	20	0	NA	NA	NA	35941588	T1393I	0.435
TXNDC5	4	5	29	0.17	289	10583	0.027	6.3	18.8	0.00001	14.4	0.0001	NA	NA	NA	183777097	P49S	0
HIST1H1B	6	6	28	0.21	512	12494	0.041	5.2	16.7	0.00004	13.3	0.0003	0	7.5	0	34144478	A211T	0.404
IMPACT	8	11	23	0.48	1602	11404	0.140	3.4	12.5	0.0004	10.8	0.001	4.3	16.9	2.2	582234	D125E	0.001
AKAP9	7	9	25	0.36	1229	11777	0.104	3.4	11.4	0.0007	9.5	0.002	1.7	15	0.6	35759833	K2476R	0.001
QRSL1	3	5	29	0.17	456	12550	0.036	4.7	12.5	0.0004	9.4	0.002	0	5	5.6	34221917	N263S	0.593
DSG2	6	7	27	0.26	759	11175	0.068	3.8	11.5	0.0007	9.2	0.002	0	9.9	0.2	2230234	I293V	0.99
HLA-DRB5	6	10	24	0.42	1288	10016	0.129	3.2	10.9	0.001	9.2	0.002	NA	NA	NA	112872773	V209L	0
JRK	4	5	29	0.17	459	12219	0.038	4.6	11.9	0.0006	8.9	0.002	NA	NA	NA	34288113	T30M	0.988
LAPTM5	11	14	20	0.70	2541	10465	0.243	2.9	10.1	0.001	8.8	0.003	2.5	25.8	7.2	35351292	R226K	0.002
DDRGK1	8	11	23	0.48	1762	11244	0.157	3.1	10.2	0.001	8.7	0.003	0	19.2	0	11591	A303T	0.004
KIF20B	7	12	22	0.55	2044	10952	0.187	2.9	9.8	0.001	8.4	0.004	2.5	26.9	37.9	1129777	A50G	0.918
GLB1	4	6	28	0.21	637	11695	0.054	3.9	10.7	0.001	8.3	0.004	NA	NA	NA	73826339	S401G	0.001
WDR55	5	7	27	0.26	873	12133	0.072	3.6	10.4	0.001	8.3	0.004	NA	NA	NA	35983033	Y235C	0.998
THUMPD3	8	10	24	0.42	1564	11442	0.137	3.0	9.7	0.001	8.1	0.004	4.3	18.9	7.8	1129174	R459Q	0.056
NLRP2	8	14	20	0.70	2629	10377	0.253	2.8	9.2	0.002	8	0.005	NA	NA	NA	34804158	T529A	0
XPO5	4	6	28	0.21	645	11467	0.056	3.8	10.1	0.001	7.9	0.005	0	14.2	0	34324334	S241N	0.002
RRS1	3	4	30	0.13	374	12560	0.030	4.5	9.4	0.002	6.6	0.01	NA	4.1	20.4	3739336	R191L	0.999
PLSCR1	6	6	28	0.21	770	12236	0.063	3.4	8.3	0.004	6.4	0.01	0	8.3	3.1	343320	H262Y	0.945
PTPN12	10	12	22	0.55	2252	10754	0.209	2.6	7.6	0.005	6.4	0.01	19.7	11.3	29.8	3750050	T573A	0.001
REV3L	4	6	28	0.21	789	12215	0.065	3.3	7.9	0.004	6	0.01	3.8	4.1	0	458017	Y1156C	0.002
ERBB2IP	7	9	25	0.36	1498	11506	0.130	2.8	7.4	0.006	6	0.01	0	17.8	8.1	3213837	S274L	0.024
SHARPIN	4	4	30	0.13	401	12255	0.033	4.1	8.1	0.004	5.6	0.02	NA	NA	NA	34674752	P294S	0.447
KIAA1755	10	19	15	1.27	4605	8401	0.548	2.3	6.2	0.01	5.4	0.02	10.8	47.5	38.8	3746471	R1045W	0.003
SEC63	3	4	30	0.13	427	12579	0.034	3.9	7.6	0.006	5.2	0.02	NA	NA	NA	17854547	V556I	0.203
LAMC2	4	6	28	0.21	848	12158	0.070	3.1	6.9	0.009	5.2	0.02	0	9.9	0.2	11586699	T124M	0.999
ZNF880	4	7	27	0.26	376	4190	0.090	2.9	6.7	0.009	5.2	0.02	NA	NA	NA	14048	V12M	0.999
IGFBP7	7	8	26	0.31	640	5676	0.113	2.7	6.6	0.01	5.2	0.02	NA	NA	NA	11573021	L11F	0.005
PHF3	3	3	31	0.10	255	12751	0.020	4.8	8.2	0.04	5.1	0.02	NA	NA	NA	34288820	V525I	0.972
KIAA0232	3	4	30	0.13	395	11489	0.034	3.9	7.5	0.006	5.1	0.02	NA	NA	NA	116439703	P1138S	0.22
PJA1	7	11	23	0.48	1742	8821	0.197	2.4	6.2	0.01	5.1	0.02	12.5	24.6	12.5	11539157	E606D	0.992
PRCP	3	3	31	0.10	265	12741	0.021	4.7	7.8	0.005	4.8	0.03	NA	NA	NA	2228312	T465S	0.021
MXRA5	14	23	11	2.09	5006	5557	0.901	2.3	5.6	0.02	4.8	0.03	22	27.6	34.9	1974522	P1665S	0.689
PARP14	7	7	27	0.26	1022	10826	0.094	2.7	6.1	0.01	4.7	0.03	4	7.6	10.9	13093808	A561E	0.972
EML4	12	15	19	0.79	3414	9592	0.356	2.2	6.6	0.02	4.7	0.03	11.7	26.7	43.9	28651764	K398R	0.003
ACOX1	13	24	10	2.40	6616	6390	1.035	2.3	5.3	0.02	4.5	0.03	13.6	34.9	29	1135640	I312M	0.007
SP110	3	6	28	0.21	928	12078	0.077	2.8	5.6	0.01	4.16	0.04	0	9	4	11556887	A128V	0.999
PPL	12	20	14	1.43	5261	7733	0.680	2.1	4.7	0.03	4	0.05	0	47.5	30.4	2037912	Q1573E	0.994
TBL2	3	4	30	0.13	489	12517	0.039	3.4	5.9	0.01	3.9	0.05	NA	3.1	NA	35607697	V345I	0.969
ZBTB45	3	4	30	0.13	491	12511	0.039	3.4	5.9	0.01	3.9	0.05	NA	NA	NA	35430780	D293E	0
TMEM106C	4	6	28	0.21	962	12044	0.080	2.7	5.1	0.02	3.8	0.05	0	11.2	40.4	2286025	S175F	0.985

Deleterious protein mutations

A selected subgroup of potentially deleterious SNPs consisted of mutations predicted to generate premature stop (PMS) codons through either nonsense substitution or a splice-site aberration leading to a frame-shift due to out-of-frame exon skip or intron retention. In this group we also retained SNPs removing a stop codon, because of their known severe biological consequences. A total of 1,593 variants of this type were called across our 17 samples, from which 77 were different nonsense SNPs and 16 were INDELs leading to a stop codon generation (Table 3). Among the stop-codon mutations, p.R93X in Steroid Receptor activator SRA1 was present in 10 of the 17 samples, and in 8 of them it was called homozygous. Given that SRA1 is involved in regulating the activity of steroid receptors and is deregulated in breast cancer48, the high expression rate of a nonsense variant might indicate functional implications in our breast cancer samples. In addition, two HER2 and two TNBC samples were positive for the p.Q281X in the zinc-activated ligand-gated ion channel (ZACN). Interestingly, the nucleotide change causing this mutation - chr17:74077797 C > T - also resides in the 5′UTR of the gene encoding exocyst complex component (EXOC7), whose deregulated expression was reported to be a strong predictor for metastatic outcome in early stage TNBC49. The two TNBC samples positive for the variant did not show presence of reference reads in this position. When all the genes affected by a deleterious mutation were analyzed through Ingenuity Pathway Analysis (IPA), the top affected molecular networks were cell death and survival, cellular development, and cellular growth and proliferation, and the top affected canonical pathway was estrogen receptor signaling (Supplementary Figure 1).

Table 3

Deleterious mutations identified among the 17 Breast Cancer Samples

Gene	Sample ID	# Samples	Function	Location	REF allele	Var allel	Change	Zygosity
ABCA10	56	1	stop-gained	chr17:67149973	G	A	p.R1322X	homozygote
C17orf77	IP2-76	2	stop-gained	chr17:72588806	C	A	p.C207X	homozygote
C17orf77	IP2-71	2	stop-gained	chr17:72588806	C	A	p.C207X	homozygote
C5orf20	56	1	stop-gained	chr5:134782450	T	A	p.R117X	homozygote
CARM1	IP2-50	1	stop-gained	chr19:11019790	C	A	p.Y155X	heterozygote
CCDC25	IP2-90	1	stop-gained	chr8:27610077	C	A	p.E66X	homozygote
DCAF11	IP2-50	1	stop-gained	chr14:24592413	C	A	p.Y486X	heterozygote
EFHA1	IP2-69	1	stop-gained	chr13:22077082	T	A	p.K306X	heterozygote
EGFL6	171	1	stop-gained	chrX:13624542	C	T	p.R189X	homozygote
ERV3-1	IP2-50	1	stop-gained	chr7:64452738	G	A	p.R223X	homozygote
EXOC7,ZACN	56	4	stop-gained	chr17:74077797	C	T	p.Q281X	heterozygote
EXOC7,ZACN	IP2-76	4	stop-gained	chr17:74077797	C	T	p.Q281X	homozygote
EXOC7,ZACN	IP2-78	4	stop-gained	chr17:74077797	C	T	p.Q281X	homozygote
EXOC7,ZACN	171	4	stop-gained	chr17:74077797	C	T	p.Q281X	heterozygote
FCGR2A	IP2-65	1	stop-gained	chr1:161476204	C	T	p.Q62X	homozygote
GAB4	IP2-90	1	stop-gained	chr22:17469049	C	A	p.G163X	homozygote
GET4	IP2-83	1	stop-gained	chr7:931966	C	A	p.Y219X	homozygote
HNRNPR	IP2-76	1	stop-gained	chr1:23637469	G	T	p.Y460X	heterozygote
IL17RB	IP2-83	1	stop-gained	chr3:53899276	C	T	p.Q484X	heterozygote
LAIR2	IP2-76	1	stop-gained	chr19:55019261	C	T	p.R76X	heterozygote
LOC1009964	26	1	stop-gained	chr6:57398270	C	T	p.Q325X	heterozygote
MAD2L1BP	IP2-49	1	stop-gained	chr6:43608124	C	T	p.R227X	heterozygote
MADD	IP2-83	1	stop-gained	chr11:47306630	C	T	p.R766X	homozygote
MAGEB16	IP2-66	1	stop-gained	chrX:35821127	C	T	p.R272X	homozygote
METAP1	IP2-50	1	stop-gained	chr4:99982427	C	T	p.R374X	heterozygote
MTA2	IP2-83	1	stop-gained	chr11:62364262	G	T	p.Y243X	homozygote
NHLRC2	IP2-76	1	stop-gained	chr10:115618327	C	A	p.Y73X	heterozygote
PDE4DIP	171	1	stop-gained	chr1:144915561	G	A	p.R622X	heterozygote
PDE4DIP	IP2-65	1	stop-gained	chr1:145075683	C	T	p.W60X	homozygote
PDE4DIP	IP2-49	2	stop-gained	chr1:144916676	C	T	p.W560X	heterozygote
PDE4DIP	26	2	stop-gained	chr1:144916676	C	T	p.W560X	heterozygote
PELI3	IP2-50	1	stop-gained	chr11:66235714	G	T	p.E39X	homozygote
PKD1L2	IP2-65	1	stop-gained	chr16:81242198	G	A	p.Q220X	homozygote
PRB4	IP2-49	1	stop-gained	chr12:11461802	G	A	p.R39X	homozygote
PRM3	IP2-76	1	stop-gained	chr16:11367143	G	A	p.R104X	homozygote
RHBDD3	IP2-49	1	stop-gained	chr22:29656431	C	T	p.W289X	homozygote
SKIV2L	IP2-42	1	stop-gained	chr6:31936654	C	T	p.R1063X	homozygote
SYNE2	26	1	stop-gained	chr14:64560092	G	A	p.W4001X	homozygote
TMEM134	IP2-42	1	stop-gained	chr11:67235051	G	A	p.R84X	heterozygote
VPS13B	171	1	stop-gained	chr8:100133706	T	G	p.Y413X	homozygote
ZSWIM3	171	1	stop-gained	chr20:44505411	G	T	p.E72X	homozygote
ANKS1A	IP2-69	1	INDEL	chr6:34738008	A	AA		homozygote
ANKS1A	IP2-42	1	INDEL	chr6:34902473	G	GT		heterozygote
CABIN1	171	1	INDEL	chr22:24455826	G	GAAAA		homozygote
CABIN1	83	1	INDEL	chr22:24448944	T	TT		homozygote
CANX	IP2-69	2	INDEL	chr5:179140762	A	AA		homozygote
CMYA5	56	1	INDEL	chr5:78982956	GCTT	GCTTCTT		homozygote
EME1	56	1	INDEL	chr17:48276005	C	CC		homozygote
LAMA3	IP2-49	1	INDEL	chr18:21434967	ATAAA	A		homozygote
MGST2	IP2-42	3	INDEL	chr4:140619265	T	TT		homozygote
MRPS15	56	1	INDEL	chr1:36921785	TA	TGGAAAA		homozygote
SLC17A5	IP2-78	1	INDEL	chr6:74351412	AC	ACACC		homozygote
SLC5A8	IP2-66	1	INDEL	chr12:101550975	CACA	CACACA		heterozygote
SMARCA5	171	1	INDEL	chr4:144340520	AAGAA	AA		heterozygote
TRAPPC9	IP2-83	1	INDEL	chr8:141413543	A	AA		homozygote
ZNF100	IP2-50	1	INDEL	chr19:21908799	CACA	CA		homozygote
ZNFX1	IP2-42	1	INDEL	chr20:47871283	GACCCTTGGA			homozygote

Novel expressed variations in breast cancer, and allele specific expression

The statistics on the filtered novel SNPs and INDELs are summarized in Supplementary Table 3; a complete list of the novel exonic annotated variations is available upon request. As expected, majority novel variants mapped within gene regions (70% of the SNPs and 66% INDELs). Filtering out of the intronic calls significantly reduced both SNP and INDEL numbers to between 43 and 186 SNPs per sample (average 76) and between none and 17 INDELs (average 8). Overall, 8% of the novel intergenic SNPs and 4% of the novel INDELs mapped within exons. Across the 17 samples, the total number of genes with coding and regulatory sequences affected by at least one novel SNP was 2103, and the genes with at least one novel INDEL were 566. A selected set of exonic variants were confirmed by Sanger sequencing (Figure 3).

Figure 3

Sanger Sequencing validation of selected variants; IGV is also presented.

(A) ESRP2, p.R353Q; (B) TPP1, p.P320S; (C) GBP1, p.R244H; (D) MAD2BP1, p.R227X. For TPP1, GBP1 and MAD both IGV and chromatogram show prevalence of the variant allele.

From the novel exonic SNPs, 285 unique SNPs were predicted to alter the protein sequence. Based on position and function, three of these SNPs were annotated to generate a novel stop codon, 114 were located within 2 bp of a splice junction, and 174 were missense, from which 70 were predicted to significantly affect the protein function. Six novel SNPs had dual annotation: missense substitutions located at a splice site. A total of 121 novel coding SNPs affected highly conservative nucleotide positions. Three novel coding SNPs - one missense and two synonymous substitutions - were called in two samples, and one - a stop codon in the solute carrier SLC30A8 (p.Q28X) - in three different breast cancer samples (Table 4). Interestingly, p.Q28X affects only one (NM_173851.2) of the five protein coding SLC30A8 splice isoforms; this isoform possesses an alternative 5′-end and is present in all three samples from our set that expressed this isoform. The stop codon is located early in the protein chain and likely leads to complete abolishment of the protein expression. Since this SLC30A8 isoform was not expressed in the remaining 14 breast cancer samples, this early stop codon may indicate regulatory mechanism preventing the expression of this particular isoform in breast tissue.

Table 4

Novel Exonic Variants that are seen in two and three out of the 17 breast Cancer Samples

Gene	Chromosomal Location	cDNA Annotation	Protein Annotation	Function	Samples	Cancer subtype
SLC30A8	chr8:118159203 C > T	c.82 C > T	p.Q28X	stop-codon	IP2-48	Non-TNBC
					IP2-49	Non-TNBC
					IP2-66	Non-TNBC
AGL	chr1:100387140 G > T	c.4484 G > T	p.C1495F	missense	IP2-49	Non-TNBC
					IP2-66	Non-TNBC
GLUD2	chrX:120182480 A > G	c.942 A > G	p.L314L	synonymous	IP2-50	TNBC
					83	TNBC
GPN1	chr2:27873001 A > G	c.1101 A > G	p.E367E	synonymous	IP2-49	Non-TNBC
					IP2-66	Non-TNBC

To assess potential allele preferential expression, we analyzed the ratio of reference and variant reads at all coding positions for the novel SNPs called by 6 or more reads harboring the variant nucleotide (Figure 4). Fifty seven of these novel SNPs were called by variant reads only (i.e. no reference call was present at the corresponding position), and additional 53 showed higher than 5-fold number of variant calling reads over the wild type (Supplementary Table 4). Among the most preferentially expressed novel SNPs were missense variants in previously linked breast cancer genes, such as methyl-thioadenosine phosphorylase MTAP (p.K71R), and melanoma antigen MAGED1 (p.G87A)50.

Figure 4

Allele preferentially expressed novel missense variants through estimation of the ratio of reference and variant reads.

The Variant-to wild type allele ration was estimated for all the novel SNPs called by 6 or more reads harboring the variant nucleotide. Fifty seven novel SNPs were called by variant reads only (i.e. no reference call was present at the corresponding position), and additional 53 showed higher than 5-fold number of variant calling reads over the wild type. Among the most preferentially expressed novel SNPs were novel missense variants in previously linked to breast cancer genes such as methyl-thioadenosine phosphorylase MTAP (p.K71R), and melanoma antigen MAGED1 (p.G87A).

Studies revealing impaired interaction of splicing coordinator ESRP2 bearing a R353Q substitution

Among the novel and rare SNPs predicted to be protein-altering in our breast cancer samples, we selected to study the functional effect of the R353Q substitution in ESRP2, based on the established connection of ESRP2 to cancer through its role in epithelial-to-mesenchymal transition (EMT)515253. Arginine 353 is located in the second RNA recognition motif (RRM) domain of the ESRP2, which is known to interact with specific pre-mRNAs sequences. There are three RRM domains in ESRP2, and they are implicated in regulating the expression of specific splice variants of FGFR2, CTNND1 and ENAH that are involved in EMT. We applied site-directed mutagenesis to generate ESRP2R353Q harboring expression vector and transfected MDA-MB-231 human breast cancer cells in parallel with expression constructs containing wild type ESPR2. After transfection, wild type and the mutant ESRP2R353Q proteins were purified and compared for their ability to bind the FGFR2 pre-mRNA region through Electrophoretic Mobility Shift Assay (EMSA); the results are shown in Figure 5. We observed strong interactions between the wild type ESRP2 and the FGFR2 pre-mRNA as previously reported52. However, this interaction was significantly impaired for the mutant ESRP2R353Q (compare lanes 6 and 7 with Lanes 2 and 3, Panels A–C), suggesting that the R353Q substitution compromises ESPR2 binding, and potentially, splice regulation of the FGFR2 pre-mRNA. This effect was observed in all three tested breast cancer cell lines: MCF-7 (Figure 5A), MDA-MB-231 (Figure 5B), and BT-549 (Figure 5C). Further, in line with previous observations52, RT-PCR showed increased expression of the epithelial FGFR2 isoform IIIb after transfection with wild type ESRP2 in the mesenchymal FGFR2 IIIc-expressing cell lines MDA-MB-231 and BT-549 (Figure 5D). This increase in FGFR2 IIIb expression was lower (BT-549) to completely abolished (MDA-MB-231) after the transfection with the mutant ESRP2R353Q (Figure 5D).

Figure 5

EMSA of ESRP2 interaction with FGFR2 ISE/ISS-3 cis-regulatory motif pre-RNA.

The R353Q mutation in ESRP2 compromises FGFR2-IIIb expression. Vector (PIBX-CFF-B), ESRP2wt or ESRP2R353Q were transiently transfected in breast cancer cell lines: MCF-7 (A), MDA-MB-231 (B) and BT-549 (C). RNA binding of ESRP2wt or ESRP2R353Q is shown. The incubated samples were resolved on 6% native-PAGE gel and detected by Phosphor imager. D) R353Q mutation in ESRP2 compromises FGFR2-IIIb expression. Vector (PIBX-CFF-B), ESRP2-Wt or ESRP2-Mut (R353Q) were transiently transfected in mesenchymal breast cancer cell lines, MDA-MB-231 and BT-549. FGFR-IIIb was detected by RT-PCR. The bands were quantified and normalized by the actin band intensities.

Discussion

Here we present the first mRNA sequencing based study that reports expressed variations in TNBC, Non-TNBC and HER2-positive breast cancer transcriptome. Several molecular mechanisms, such as RNA editing and allele preferential expression, could cause a discrepancy between the variations found at mRNA and DNA levels. Compared to exome and genome sequencing, RNA-seq provides essential insights into the functionality of the variants through estimation of the absolute and relative abundance of variant reads and the co-existence or mutual exclusion of variations, expression and splicing patterns. In addition to outlining the general landscape of the breast cancer variation transcriptome, our study reports novel variants in an allele-specific expression context, aligns our findings with the existing knowledge on breast cancer genetics, and exemplifies efficient extraction of information from the transcriptome through extensive analyses. Mutations in previously associated breast cancer genes, BRCA1, BRCA2 and ATM, were called in 9/17 (53%) samples which is a higher than the previously reported mutation prevalence among breast cancer patients51213141516181920. While only one patient was a carrier of known pathological variants in both BRCA1 and ATM, five other individuals carried missense substitutions in at least two different breast cancer associated genes (see Table 1). Whether the disease in these patients could be contributed to cumulative impaired functioning of these genes is a subject of further investigation; nevertheless, the relatively frequent co-occurrence of protein altering variations in known breast cancer-associated genes in different cancers raises the necessity to examine larger series of patients and controls for combinatorial genetic risk. An interesting observation is the high prevalence of homozygote vs. heterozygote calls in BRCA1, BRCA2 and ATM for both breast cancer-associated genes, and those not known to be pathogenic variants, suggesting potential allelic loss in those genes. Among the essential findings of our study is a subset of novel SNPs and INDELs, some affecting genes previously implicated in breast cancer, in which, however, no predisposing or causative point variants have been reported so far. An example is p.K71R in MTAP, frequently seen co-deleted with the CDKN2A and CDKN2B tumor suppressor genes in a large cohort of 2000 breast tumors50. While the biological significance of p.K71R in MTAP and other novel variations in cancer-associated genes is currently unclear, overexpression of novel variant over reference alleles points to a possible contribution to tumor initiation or progression. Since these variants have not been previously reported, they are not likely to be present in a homozygote state at the genomic level, and their allelic dominance may indicate expression or growth advantage, as well as potential loss of heterozygosity. Because such events may drive or contribute to cancer, a systematic investigation of allelic dominance of novel variants across larger expression sets is needed. In addition, we also identified a higher frequency (compared to non-breast cancer populations) of previously reported variants in many genes, including breast cancer-associated genes, such as PTPN12, PRCP, PLEC, DSG2 and ERBB2IP. Estimation of the prevalence of such variants in larger breast cancer cohorts is needed as it may indicate contribution to genetic risk or co-existence with causative mutations. Although this analysis holds promising potential to identify overrepresented alleles, it is important to take into account that transcriptome sequencing variant calls differs from the exome sequencing in allelic representation of homo- and heterozygote state (i.e. number of alleles). While homo vs. heterozygosity on transcriptome level provides an additional layer of information on the potential functionality of these variants, the results should be used only after confirmation by independent studies. Nevertheless, statistical confidence may be increased for SNPs in which the difference is achieved through the analysis of high number of samples rather than homozygote appearance, such as PTPN12 and DSG2 (see Table 2). Such prevalent variants in genes implicated in breast cancer are worthy of investigation in independent breast cancer datasets. Similarly to the above discussed prevalence of mutant reads in breast cancer-associated variants, GWAS associated SNPs in our set also showed high prevalence of homozygous vs. heterozygous calls. This overall prevalence of variant over reference reads for variant positions in cancer implicated genes, needs to be further investigated as potential indicator of mechanistic implications, such as loss of heterozygosity or preferential allelic expression. As the information content of the transcriptome as a common denominator combining frequency and expression data is emerging, large scale studies are expected to enlighten the feasibility and the information value of these types of analyses32. Finally, we selected a rare, predicted to be protein damaging missense substitution from our dataset – p.R353Q in the splicing coordinator ESRP2 - to demonstrate in vitro the effect of the p.R353Q substitution on the ESRP2 protein function. We were able to show that the replacement of the arginine 353 with the variant glutamine leads to a significant reduction of the binding ability of ESRP2 to FGFR2 pre-mRNA. Thus, this could potentially affect epithelial-to-mesenchymal transition programs. Overall, our analysis identified enrichment of variants known to be implicated in breast cancer as well as novel and rare variants in genes associated with breast cancer in our set of 17 breast cancer samples. Further, the within-individual exploration of the variance showed multiple disease associating variants in most of the individuals, and points to the need for estimation of cumulative action of genetic alterations. This study reports an initial collection of variants that are expressed across the breast cancer transcriptome, including novel and reported mutations in their allelic abundance and co-presence with other variants. In addition to providing an overall variation landscape of the breast cancer transcriptome, such as expressed SNP density and deleterious variants scaffold, we exemplify different analytical strategies to search for molecular interactions and regulatory networks potentially implicated in breast tumorigenesis. Compared to exome and genome studies, transcriptome exploration provides higher information content through the estimation of the expression abundance, in the immediate context of allelic prevalence and co-existence with expression and splicing features5455. It is essential to keep in mind however that the transcriptome only captures a snap shot and further functional characterization of the observed molecular features is needed to prove disease-causative relationships. Nevertheless, our study provides an important breast cancer transcriptome dataset for further explorations on either high-throughput or individual gene/protein scale.

Methods

Human patient samples

The human breast cancer tissue RNA samples were provided by Dr. Suzanne Fuqua (Baylor College of Medicine). All of the human samples were used in accordance with the IRB procedures of Baylor College of Medicine and Dana-Farber Cancer Institute and Harvard Medical School, respectively. The breast tumor types, TNBC, Non-TNBC and HER2-positive, were classified on the basis of RNA sequencing FPKM abundance and immunohistochemical and qRT-PCR classification (data not shown) as previously described5455. All breast cancer patients were from European descent.

Illumina genome sequencing RNA sequencing library preparation

Large and small ribosomal RNA (rRNA) was removed from total RNA using RiboMinus Eukaryote Kit (Invitrogen, Carlsbad, CA). Five micrograms of total RNA were hybridized to rRNA-specific biotin labeled probes at 70°C for 5 minutes. The rRNA-probe complexes were then removed by streptavidin-coated magnetic beads. The rRNA-free transcriptome RNA was concentrated by ethanol precipitation. The cDNA synthesis and DNA library construction for all the seventeen samples were performed as described5455.

Read alignment and transcript assembly

The paired end raw reads were aligned using the TopHat version 1.2.0 that allows two mismatches in the alignment. The aligned reads were assembled into transcripts using cufflinks version 2.0.0. The alignment quality and distribution of the reads were estimated using SAM tools. From the aligned reads, de novo transcript assembly was performed to capture the major splice rearrangements and novel variations that occur in the transcriptomes of TNBC, Non-TNBC and HER2- positive breast cancers in comparison to NBT using cufflinks version 1.3.036. The cuffcompare program was used to identify transcripts that are identical to the reference human genome (the Ensembl GRCh37.62 B (hg19) reference genome). Further analysis and novel isoform call was performed through the reconstructed transfrags that comprise novel splice junctions and share at least one splice junction with a reference transcript. The very low abundant transcripts were identified by binning the transcripts according to their FPKM and the transcripts with FPKM below 0.3 were eliminated from further analysis. All the analyses presented in this manuscript are performed using two categories of transcripts: transcripts that are identical to reference and transcripts that comprise novel junctions. The global statistics, which includes the distributions of FPKM scores across samples and the dendogram that shows the relationship between the samples based on the reconstructed transcripts, were analyzed using cummeRbund package of cufflinks suite of programs. The average exon number was in the reassembled transcripts is comparable to the human genome reference average. To annotate novel splice events, we used Multivariate Analysis of Transcript Splicing (MATS). Additionally, for consistency checking and independent validation we used an inhouse built program (http://ccb.jhu.edu/software/ASprofile/) to compare the exon models between isoforms assembled with the program cufflinks for the normal and cancer samples. As mentioned earlier, only the isoforms that are similar to reference and isoforms that comprise novel splice junctions were considered. We determined the splicing differences indicative of exon inclusion, exclusion, alternative 5′, 3′, and intron retention events.

Variants call and annotation

Variants calls were obtained using Mpileup utility of SAMTools (http://samtools.sourceforge.net/mpileup.shtml). Base Alignment Quality was used to score the variant call. Consensus calling is done using bcftools. Maximum depth call was set at 100000. The variants were annotated using SeattleSeq Annotation Tools version 8.01, dbSNP build 137 (http://snp.gs.washington.edu/SeattleSeqAnnotation137/).

Sanger sequencing

First-strand cDNA was synthesized with SuperScript III reverse transcriptase (Invitrogen, Inc) using 1 μg of total RNA and mixture of oligo dT primer and random hexamers. For selected variants, cDNA primers flanking the variant position were designed using Primer356 and in RT-PCR to amplify the region of interest. The products were separated on 1% agarose gel, excited and purified using QIAquick Gel Extraction Kit (Qiagen, Inc.) according to the manufacturer instructions. The purified fragments were subjected to bi-directional Sanger sequencing with the forward and the reverse primer used for the amplification.

Statistics

To test if the distribution of variant alleles differed between our group and non-breast cancer populations, we applied chi-square test (2 × 2 tables). All the values were subjected to Yates correction for contingency to prevent overestimation of significance; p values below 0.05 were considered significant.

EMSA

To determine if the R353Q substitution affects the ability of ESRP2 to bind its substrate we used wild type and mutant FLAG-tagged ESRP2 ORF introduced in PBIX as previously described52. Three cell lines MCF-7, MDA-231 and BT-549 were transfected cell lines using FuGENE® Transfection Reagents (Promega, Inc.) according to the manufacturer recommendations. Nuclear extracts were prepared using a Nonidet P-40 lysis method. RNA oligos of ISE/ISS-3 were end labeled with using the annealed [γ-32P] ATP in a 20 μl reaction mixture for 15 min at room temperature. RNA probes were incubated with respective nuclear extracts. Samples were run on a non-denaturing 6% polyacrylamide gel and imaged by autoradiography. Specific competitions were performed by adding a 100-molar excess of unlabeled probe to the incubation mixture and supershift Electrophoretic mobility shift assay (EMSA) were performed using FLAG antibody (Sigma-Aldrich).

Author Contributions

R.K. directed the project and designed the experiments. A.H. and R.K. designed the analyses, analyzed the data, and wrote the manuscript. P.M. and S.B.P. carried out dry-lab and validation studies, respectively. S.D.R. and K.O. performed EMSA and RT-PCR experiments. S.C., R.P., L.C., S.A.W.F., M.T. and S. S. provided reagents and biological insights. S.S.N. carried out the initial RNA-sequencing experiments and editing.

55 in total

1. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples.

Authors: Si Quang Le; Richard Durbin
Journal: Genome Res Date: 2010-10-27 Impact factor: 9.043

2. Tax1 enhances cancer cell proliferation via Ras-Raf-MEK-ERK signaling pathway.

Authors: Chunjiao Song; Weimin Wang; Meng Li; Yanxin Liu; Dexian Zheng
Journal: IUBMB Life Date: 2009-06 Impact factor: 3.885

3. High constant incidence in twins and other relatives of women with breast cancer.

Authors: J Peto; T M Mack
Journal: Nat Genet Date: 2000-12 Impact factor: 38.330

4. Comparison of BRCA1 polymorphisms, rare sequence variants and/or missense mutations in unaffected and breast/ovarian cancer populations.

Authors: F Durocher; D Shattuck-Eidens; M McClure; F Labrie; M H Skolnick; D E Goldgar; J Simard
Journal: Hum Mol Genet Date: 1996-06 Impact factor: 6.150

5. Identification of the breast cancer susceptibility gene BRCA2.

Authors: R Wooster; G Bignell; J Lancaster; S Swift; S Seal; J Mangion; N Collins; S Gregory; C Gumbs; G Micklem
Journal: Nature Date: 1995 Dec 21-28 Impact factor: 49.962

6. E-cadherin germline mutations in familial gastric cancer.

Authors: P Guilford; J Hopkins; J Harraway; M McLeod; N McLeod; P Harawira; H Taite; R Scoular; A Miller; A E Reeve
Journal: Nature Date: 1998-03-26 Impact factor: 49.962

7. Low-penetrance susceptibility to breast cancer due to CHEK2(*)1100delC in noncarriers of BRCA1 or BRCA2 mutations.

Authors: Hanne Meijers-Heijboer; Ans van den Ouweland; Jan Klijn; Marijke Wasielewski; Anja de Snoo; Rogier Oldenburg; Antoinette Hollestelle; Mark Houben; Ellen Crepin; Monique van Veghel-Plandsoen; Fons Elstrodt; Cornelia van Duijn; Carina Bartels; Carel Meijers; Mieke Schutte; Lesley McGuffog; Deborah Thompson; Douglas Easton; Nayanta Sodha; Sheila Seal; Rita Barfoot; Jon Mangion; Jenny Chang-Claude; Diana Eccles; Rosalind Eeles; D Gareth Evans; Richard Houlston; Victoria Murday; Steven Narod; Tamara Peretz; Julian Peto; Catherine Phelan; Hong Xiang Zhang; Csilla Szabo; Peter Devilee; David Goldgar; P Andrew Futreal; Katherine L Nathanson; Barbara Weber; Nazneen Rahman; Michael R Stratton
Journal: Nat Genet Date: 2002-04-22 Impact factor: 38.330

8. A comprehensive resource of interacting protein regions for refining human transcription factor networks.

Authors: Etsuko Miyamoto-Sato; Shigeo Fujimori; Masamichi Ishizaka; Naoya Hirai; Kazuyo Masuoka; Rintaro Saito; Yosuke Ozawa; Katsuya Hino; Takanori Washio; Masaru Tomita; Tatsuhiro Yamashita; Tomohiro Oshikubo; Hidetoshi Akasaka; Jun Sugiyama; Yasuo Matsumoto; Hiroshi Yanagawa
Journal: PLoS One Date: 2010-02-24 Impact factor: 3.240

9. RNA sequencing of cancer reveals novel splicing alterations.

Authors: Jeyanthy Eswaran; Anelia Horvath; Sucheta Godbole; Sirigiri Divijendra Reddy; Prakriti Mudvari; Kazufumi Ohshiro; Dinesh Cyanam; Sujit Nair; Suzanne A W Fuqua; Kornelia Polyak; Liliana D Florea; Rakesh Kumar
Journal: Sci Rep Date: 2013 Impact factor: 4.379

10. TopHat: discovering splice junctions with RNA-Seq.

Authors: Cole Trapnell; Lior Pachter; Steven L Salzberg
Journal: Bioinformatics Date: 2009-03-16 Impact factor: 6.937

30 in total

Review 1. Emerging Role of Genomics and Cell-Free DNA in Breast Cancer.

Authors: Lorenzo Gerratana; Andrew A Davis; Ami N Shah; Chenyu Lin; Carla Corvaja; Massimo Cristofanilli
Journal: Curr Treat Options Oncol Date: 2019-06-29

2. Extraction of Molecular Features through Exome to Transcriptome Alignment.

Authors: Prakriti Mudvari; Kamran Kowsari; Charles Cole; Raja Mazumder; Anelia Horvath
Journal: J Metabolomics Syst Biol Date: 2013-08-22

3. The association of the expression of miR-122-5p and its target ADAM10 with human breast cancer.

Authors: Sercan Ergün; Mustafa Ulasli; Yusuf Ziya Igci; Mehri Igci; Sevil Kırkbes; Ersin Borazan; Ahmet Balik; Önder Yumrutaş; Celalettin Camci; Ecir Ali Cakmak; Ahmet Arslan; Serdar Oztuzcu
Journal: Mol Biol Rep Date: 2014-10-16 Impact factor: 2.316

4. Computational Inferences of the Functions of Alternative/Noncanonical Splice Isoforms Specific to HER2+/ER-/PR- Breast Cancers, a Chromosome 17 C-HPP Study.

Authors: Rajasree Menon; Bharat Panwar; Ridvan Eksi; Celina Kleer; Yuanfang Guan; Gilbert S Omenn
Journal: J Proteome Res Date: 2015-07-23 Impact factor: 4.466

5. Global transcriptome and sequenome analysis of formalin-fixed salivary epithelial-myoepithelial carcinoma specimens.

Authors: Isabel Fonseca; Achim Bell; Khalida Wani; Diana Bell
Journal: Genes Chromosomes Cancer Date: 2014-12-29 Impact factor: 5.006

Review 6. Toxicogenomics and cancer susceptibility: advances with next-generation sequencing.

Authors: Baitang Ning; Zhenqiang Su; Nan Mei; Huixiao Hong; Helen Deng; Leming Shi; James C Fuscoe; William H Tolleson
Journal: J Environ Sci Health C Environ Carcinog Ecotoxicol Rev Date: 2014 Impact factor: 3.781

7. A novel FADS2 isoform identified in human milk fat globule suppresses FADS2 mediated Δ6-desaturation of omega-3 fatty acids.

Authors: Kumar S D Kothapalli; Hui Gyu Park; Xiaoxian Guo; Xuepeng Sun; James Zou; Stephanie S Hyon; Xia Qin; Peter Lawrence; Rinat R Ran-Ressler; Ji Yao Zhang; Zhenglong Gu; J Thomas Brenna
Journal: Prostaglandins Leukot Essent Fatty Acids Date: 2018-06-28 Impact factor: 4.006