Literature DB >> 28811643

Overexpressed somatic alleles are enriched in functional elements in Breast Cancer.

Paula Restrepo1,2, Mercedeh Movassagh3, Nawaf Alomran2,4, Christian Miller2, Muzi Li2,4, Chris Trenkov2, Yulian Manchev2, Sonali Bahl1, Stephanie Warnken5, Liam Spurr1,2, Tatiyana Apanasovich6, Keith Crandall5, Nathan Edwards4, Anelia Horvath7,8,9,10.   

Abstract

Asymmetric allele content in the transcriptome can be indicative of functional and selective features of the underlying genetic variants. Yet, imbalanced alleles, especially from diploid genome regions, are poorly explored in cancer. Here we systematically quantify and integrate the variant allele fraction from corresponding RNA and DNA sequence data from patients with breast cancer acquired through The Cancer Genome Atlas (TCGA). We test for correlation between allele prevalence and functionality in known cancer-implicated genes from the Cancer Gene Census (CGC). We document significant allele-preferential expression of functional variants in CGC genes and across the entire dataset. Notably, we find frequent allele-specific overexpression of variants in tumor-suppressor genes. We also report a list of over-expressed variants from non-CGC genes. Overall, our analysis presents an integrated set of features of somatic allele expression and points to the vast information content of the asymmetric alleles in the cancer transcriptome.

Entities:  

Mesh:

Year:  2017        PMID: 28811643      PMCID: PMC5557904          DOI: 10.1038/s41598-017-08416-w

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

The cancer phenotype is largely driven by somatic mutations, whose carcinogenic effects are ultimately intervened by the transcription process[1-3]. As a mediator between genotype and phenotype, the tumor transcriptome reflects both advantage- selective pressure, and direct effects of the mutations on the transcription process. Hence, the tumor transcriptome is highly informative about the somatic functionality, especially through allele-specific approaches that can confine expressed structures to particular mutant alleles[1-4]. Several studies have explored the allele-specific transcriptional landscape of cancer[1, 5–10]. Preferentially expressed alleles are reported to play a role in epithelial ovarian cancer[7], as well as in microRNA-implicated carcinogenesis, an example of which is miR-31 dysregulation in lung cancer[8]. Imbalanced allele expression can be caused by both large chromosomal alterations, such as copy number alterations (CNAs), and single nucleotide somatic mutations[1]. Nucleotide somatic mutations can affect the transcriptome through alteration of regulatory, splicing, or expression-rate modifying sites. Such effects commonly manifest in cis-fashion and directly impact the transcript abundance of the mutation bearing allele[1, 11, 12]. Mutations can also indirectly imbalance the allele content through changing the protein functions to either advance or impair the tumor growth. Functional mutations that provide selective advantage are referred to as drivers, and they are commonly targeted by either positive or negative selection forces to retain or deplete the growth-affecting allele[13-16]. Accordingly, somatic allele imbalance, including the extremes of loss or over-expression, can indicate tumorigenic functionality. Expression imbalance of point mutations is particularly informative for regions with no CNAs, where potential effects on the transcription can be directly linked to the underlying nucleotide change[14]. Therefore, quantitative integration of allele signals between same-source DNA and RNA is instrumental for tracking chromosome-of-origin effects. The latter, in turn, can be used to search for new genes whose allele behavior follows the pattern of known cancer drivers and is thus indicative for potential carcinogenicity. Therefore, the few studies that quantitatively integrate allele abundance from matching DNA and RNA sequencing sources are very informative[10]. Herein, we apply a software that we recently developed – RNA2DNAlign[9] – to systematically quantify the allele expression of somatic variants in breast cancer samples from The Cancer Genome Atlas (TCGA). RNA2DNAlign counts variant and reference sequencing reads derived from compatible RNA and DNA datasets, and tests for allelic imbalance; it also calls positions with extreme allele distributions, including somatic over-Expression (SOM-E) or loss (SOM-L). We compute and compare the somatic variant allele fraction (VAF) of mutations in genes from the Cancer Gene Census (CGC)[17] to those in the rest of the genes in our samples. We also report a list of non-CGC genes with over-expressed somatic variants. Overall, we present an integrated set of somatic allele-specific expression features, in the context of their potential underlying functionality.

Results

Strategy

Our strategy was to first systematically quantify the variant allele fraction of the tumor RNA (VAF{tRNA}), and then to assess for correlation between RNA allele asymmetry and functional features (Fig. 1). Somatic variants (SOM) with a bi-allelic signal in the tumor DNA and a mono-allelic signal in the tumor RNA were classified as SOM-L (VAF{tRNA} ~ 0) or SOM-E (VAF{tRNA} ~ 1; Fig. 2). We assess both absolute VAF{tRNA}, and relative to VAF{tDNA}, for which we introduce the expression VR:D = VAF{tRNA}:VAF{tDNA}. We note that through accounting for the VAF{tDNA}, VR:D reflects the overall genome composition of the sample, including the contribution from large rearrangements, and admixture with non-tumor genomes (i.e. the sample purity). First, we analyzed the allele distribution for mutations in known oncogenes and tumor suppressors from CGC. We evaluated VAF{tRNA} and VR:D for correlation with functional features including conservation, predicted pathogenicity, and location in critical sequence motifs. Next, we assessed these features, in the context of their allelic expression, in the non-CGC dataset, and highlighted variants whose somatic allele patterns follow functionality-associated allele behavior of known cancer drivers.
Figure 1

Major steps of the analysis of allele distribution for somatic variants in our dataset. VR:D was analyzed for correlation with different functional mutations groups in oncogenes, tumor suppressors, and the rest of the genes. SOM-E and SOM-L variants were compared with the rest of the somatic mutations for predicted pathogenicity and location in functional motifs such as transcription and splice factor binding sites, and highly preserved sequences.

Figure 2

IGV visualization of somatic mutations that are over-expressed (SOM-E, middle) or under-expressed (SOM-L, right) compared to expected allele distribution for a germline heterozygote variant (left); the heterozygosity is reflected through color-coding of the summary flag on the top of each panel. The gray lines represent reads, and the colored letters show differences from the reference.

Major steps of the analysis of allele distribution for somatic variants in our dataset. VR:D was analyzed for correlation with different functional mutations groups in oncogenes, tumor suppressors, and the rest of the genes. SOM-E and SOM-L variants were compared with the rest of the somatic mutations for predicted pathogenicity and location in functional motifs such as transcription and splice factor binding sites, and highly preserved sequences. IGV visualization of somatic mutations that are over-expressed (SOM-E, middle) or under-expressed (SOM-L, right) compared to expected allele distribution for a germline heterozygote variant (left); the heterozygosity is reflected through color-coding of the summary flag on the top of each panel. The gray lines represent reads, and the colored letters show differences from the reference.

Overall dataset characteristics

A total of 1238 (1139 unique) mutations in 921 genes, from which 68 were listed in CGC, satisfied the requirements for our analysis (Supplementary Table 1 and Supplementary Figure 1). Between 7 and 51 somatic point mutations in expressed coding regions were assessed per individual sample. Most of the mutations (94%) were singletons (present in only one sample), whereas 44 mutations were seen in 2, 12 in 3, 4 in 4, 2 in 5, and one mutation each was found in 6 and 7 different samples. Notably, all non-singleton mutations shared similar allele expression status across the different samples. A total of 437 somatic mutations (38.3%) were not expressed at all in the transcriptome (SOM-L), and 73 mutations (4.9%) were over-expressed (SOM-E). The analysis of the variant allele fraction showed an overall positive correlation between VAF{tDNA} and VAF{tRNA} (Spearman correlation r = 0.38, Fig. 3A–C). The functional distribution of the predicted consequences on the protein, and the intersection with their allele-expression status is presented on Fig. 3D. The missense, non-coding and stop-codon variants showed clearly different patterns of VR:D with a higher VR:D in the missense mutations, as compared to the non-coding and stop-codon variants (p = 0.0004, Kruskal-Wallis test[18], Fig. 3E). Notably, we observed distribution towards higher VR:D of the variants predicted to be pathogenic through FATHMM (Functional Analysis Through Hidden Markov Models), Fig. 3F [19, 20].
Figure 3

(A–C) Distribution of VAFtRNA (blue) and VR:D (red) in the subgroups of missense (A), non-coding (B) and stop-codon variants. The X axis shows the number of variants in each functional category. Positive correlation is seen in all three mutation groups. (D) Distribution of SOM-E and SOM-L expression status in regards to predicted effect on the protein function in the entire set, CGC-, and non-CGC variants. (E) VR:D for non-coding, missense and stop-codon variants across the entire dataset. Clearly different VR:D distribution is seen among the different functional subtypes, with the missense mutations showing higher VR:D, indicative for higher allele expression of potentially functional transcripts. (F) VR:D for pathogenic and neutral variants as predicted by FATHMM. The difference in the distribution is due to the larger proportion of the pathogenic mutations with higher VR:D.

(A–C) Distribution of VAFtRNA (blue) and VR:D (red) in the subgroups of missense (A), non-coding (B) and stop-codon variants. The X axis shows the number of variants in each functional category. Positive correlation is seen in all three mutation groups. (D) Distribution of SOM-E and SOM-L expression status in regards to predicted effect on the protein function in the entire set, CGC-, and non-CGC variants. (E) VR:D for non-coding, missense and stop-codon variants across the entire dataset. Clearly different VR:D distribution is seen among the different functional subtypes, with the missense mutations showing higher VR:D, indicative for higher allele expression of potentially functional transcripts. (F) VR:D for pathogenic and neutral variants as predicted by FATHMM. The difference in the distribution is due to the larger proportion of the pathogenic mutations with higher VR:D.

CGC genes somatic allele expression: overall features

The 68 known cancer driver genes collectively contained 103 (88 unique) somatic mutations qualifying for the analysis (Supplementary Table 2)[17]. Mutations in PIK3CA, MITF, ACVR2A, CLIP1, and TCEA1 were called in more than one sample. In this gene-set, we called 10 SOM-E variants: seven missense substitutions, two synonymous variants, and, notably, the stop-codon R63X in CDH1. Of note, four of the SOM-E missense substitutions were called in TP53 (See Supplementary Table 2). A higher number - 25 - SOM-L variants were completely absent from the transcriptome in the CGC dataset. Several noticeable observations were made in the CGC subset. First, different VR:D distribution was observed in the CGC variants as compared to the rest of the dataset (p = 0.02, Kruskal-Wallis test[18], Fig. 4A); the difference due to larger proportion of CGC variants with higher allele expression. Second, the CGC missense mutations showed higher allele expression as compared to the missense mutations in the entire dataset (p = 0.03, Kruskal-Wallis test[18], Fig. 4B). Notably, a tendency for higher VR:D was also seen for the stop-codon mutations, albeit not reaching statistical significance (Fig. 4C). In contrast, the non-coding variants did not show significant differences between the CGC and non-CGC genes (Fig. 4D). Third, we documented positive correlation between VR:D and predicted pathogenicity assessed by the CADD score (Combined Annotation Dependent Depletion)[21], (Spearman r = 0.25), FATHMM score (Functional Analysis Through Hidden Markov Models)[19, 20] (Spearman r = 0.17), and conservation of the position of the somatic mutation as assessed through GERP (Genomic Evolutionary Rate Profiling, Spearman r = 0.29)[22-26]. Of note, 21% of the variants in the CGC dataset modeled through FATHMM as pathogenic have been reported in cancer-based studies[17]. Collectively, all the above analyses supported preferential expression of functional alleles in the CGC dataset.
Figure 4

VR:D in the CGC vs non-CGC genes (A), in missense variants (B), in stop-codon variants (B), and in non-coding variants.

VR:D in the CGC vs non-CGC genes (A), in missense variants (B), in stop-codon variants (B), and in non-coding variants. We then assessed CGC SOM-E and SOM-L mutations in the context of their harboring gene’s function and mechanism of action. The first noticeable observation was a tendency for over-representation of genes acting in recessive molecular mode among the SOM-E variants, as opposed to more-frequent dominant mode of action in the genes bearing SOM-L variants (p = 0.15). Recessive mode is traditionally more often associated with tumor-suppressive function, while dominant action is reported frequently for oncogenes[27]. In our study, SOM-E status appears not to result from a genomic DNA loss, as evident by the tumor DNA’s biallelic signal (0 < VAF{tDNA} < 1). Both the inhibition of the reference and the enhancement of the mutant allele’s expression could result in mutant RNA dominance, and these effects could be independent or related to the functionality of the particular mutation. In the case of the mutations acknowledged as pathogenic in suppressor genes, the observed overexpression is consistent with mutation-driven allele inactivation, possibly favored by positive selection forces. Such interpretation is in line also with the over-expressed stop-codon R63X in CDH1 [28]. For the SOM-L mutations, whether their expressional loss is linked to potential oncogenic action of the host gene, is to be determined on per-gene basis. It is important to recognize that many somatic variants are randomly lost in the tumor transcriptome, and the number of transcribed ones can depend on factors such as Estrogen Receptor (ER) expression levels[1]. While it is possible for a SOM-L variant to reside on a lost allele by coincidence, this is unlikely to explain all SOM-L patterns for variants with known pathogenicity.

Allele expression of somatic mutations in the non-CGC genes

The integrated features of somatic allele expression in the non-CGC genes is presented in Supplementary Table 3. We documented concurrent to the CGC dataset positive correlation between increased allele expression and predicted pathogenicity and conservation scores (Spearman CADD r = 0.11, FATHMM r = 0.12, and GERP r = 0.17 (Supplementary Table 3). The non-CGC somatic mutations with strong overexpression of the mutation-bearing allele (VAF{tRNA} = 1) are presented in Table 1. We next assessed the SOM-E variants for location within transcription and splicing factor binding sites, including analysis for generation of a new binding site outside of known protein - recognizable sequences[29]. Indeed, 18 out of the 42 non-CGC SOM-E variants positioned outside an existing TFBS were predicted to generate a new motif recognizable by either transcription or a splicing factor[29, 30].
Table 1

SOM-E mutations in non-GCG genes: location within transcription and splicing factor recognizable motifs.

GeneChr:pos (hg38)FunctionTFBSSFBS
TMEM51chr1:15215414C > Amissensenonenone
NBPF3chr1:21481730T > Cnon-codingnonenone to SRp40
EPHA10chr1:37720517C > Tmissensenonenone
KIF26Bchr1:245609349C > Gmissensenone to V$LRH1_Q5_01none
ILDR1chr3:122001432G > Anon-codingV$PPARG_02none to Sam68, SLM-2
MUC20chr3:195725818C > Tnon-codingV$CREB1_Q6hnRNP DL, SRp55tonone
ZNF518Bchr4:10445288C > GmissenseV$PBX1_02none
BBS7chr4:121828063C > Gnon-codingnonehnRNP, HuB, MBNL1toTIA-1
OTUD4chr4:145146395G > AmissensenoneSRP4 0to hnRNPA1
SH3RF1chr4:169136534G > Anon-codingnoneMBNL1 to SRp40
SORBS2chr4:185589715C > TmissensenoneYB-1 to SAM68
MYO10chr5:16877688C > GmissenseV$YY1_01none
MSH3chr5:80768937T > AmissenseV$STAT3_01none
PCDHB5chr5:141136316C > Tnon-codingnonenone
GRPEL2chr5:149351223G > AmissenseV$YY1_02none
TCOF1chr5:150376236C > TmissensenoneSRp20/Nova-1/Nova-2 to none
MDN1chr6:89700782A > Tnon-codingV$SMAD4_Q6_01none
TNRC18chr7:5316065C > Anon-codingnonenone
WDR60chr7:158871385A > Gmissensenonenone to SC35,SF2/ASF,hnRNPA1
FZD3chr8:28527405G > Anon-codingnonenone
DAPK1chr9:87706999C > TmissenseV$NFAT_Q6none
COL27A1chr9:114309301C > Gmissensenone to V$MYOGENIN_Q6_01none
PLCE1chr10:94270600A > Cmissensenone to V$NFAT1_Q4SF2/ASF,hnRNPA1 to none
PDCD11chr10:103441838A > CmissensenoneYB-1 to SRp-40
MUC6chr11:1016406G > Amissensenone to V$NFAT1_Q4none
ACER3chr11:76861031G > TmissensenoneSRp30c to none
RAB38chr11:88175236A > TmissenseV$PPARG_02none
PHLDB1chr11:118627958C > TmissenseV$IK3_01none to HuB,TIA-1,SRp40
WNK1chr12:753666C > GmissenseV$GFI1_01none
NFE2chr12:54292991G > Amissensenone to V$BEN_01none to YB-1,SRp40
NUAK1chr12:106067839A > TmissenseV$OCT1_06none
RASAL1chr12:113114816C > GmissenseV$YY1_01none
SLITRK6chr13:85795773C > AmissenseV$SMAD4_Q6_01SF2/ASF,SRp38,YB-1 to Sam68
ATP11Achr13:112858175C > AmissenseV$PAX5_01none
NYNRINchr14:24411385C > Gnon-codingnone to V$BEN_01MBNL1
CLMNchr14:95203587C > Tmissensenonenone to hnRNPI
AHNAK2chr14:104948892T > Cmissensenonenone
RAD51chr15:40706209C > Anon-codingV$CEBPB_02none
CCNB2chr15:59125011G > Anon-codingnonenone
SULT1A2chr16:28592021A > Gnon-codingnoneSRp30c to none
NFATC3chr16:68190983G > Amissensenone to V$GATA_Q6none to SLM-2, Sam68
MED31chr17:6651601A > Gnon-codingnoneSRp30c to none
CHRNB1chr17:7447082C > Tnon-codingnonenone to ETR-3
ACBD4chr17:45136583C > TmissensenoneSRp55t to SC35
ABCA7chr19:1041510G > Amissensenonenone to YB-1, SRp20
LMNB2chr19:2431813G > Anon-codingnoneSRp55 to SC35
ZNF676chr19:22180184G > Tnon-codingnone to V$NFAT1_Q4deleted MBNL1
ZIM2chr19:56774836G > Tstopnone to V$DRI1_01none to Sam68, SLM-2
MRPL30chr2:99181122C > Anon-codingnone to V$NFAT1_Q4SLM-2 to hnRNP,DAZAP1, HuD
PASKchr2:241126376C > GmissensenoneETR-3 to SF2/ASF
TOP3Bchr22:21964200A > Tnon-codingnonehnRNPH1,hnRNPH2 to none
GGA1chr22:37620258G > AsynonymousnoneETR-3, SRp30c to hnRNPH1/2
RIBC2chr22:45426055G > Anon-codingnonehnRNP K to SF2/ASF
GRPRchrX:16123978C > Gmissensenonenone
TBC1D25chrX:48560553C > Gmissensenonenone
IGBP1chrX:70133976C > TmissensenoneMNBL1 to SRp40, SRp55
HTATSF1chrX:136510164G > Cmissensenonenone to SRp20, YB-1
SOM-E mutations in non-GCG genes: location within transcription and splicing factor recognizable motifs. Next, we reviewed, on a per-gene basis, the current knowledge on the SOM-E genes and their possible implications in cancer. Despite not being listed in the CGC, some of these genes – such as MSH3 and NUAK1 and NFE2 – have been repeatedly linked to cancer[31-33]. Notably, more of the SOM-E genes linked to tumor suppressor features (as opposed to oncogenic, p = 3.8e-4, Metacore), which we concurrently observed in the CGC dataset[34, 35]. Another striking observation is that 6 of the genes with SOM-E variants –MSH3, RAD51, TCOF1, TP53BP1, CCNB2, and TOP3B – are directly implicated in DNA damage response and repair[36-39] which was also the top-enriched pathway in the SOM-E dataset (p = 0.05, Metacore). In contrast, the most represented pathway in the SOM-L group was the immune response (p = 0.05, Metacore). In regards to GO annotations, two differences were detected between the SOM-E and SOM-L groups (Supplementary Figure 2). First, SOM-E variants were more frequently located in genes encoding receptors and signal transducers, while a higher proportion of the SOM-L variants resided in structure-supportive genes. In regards to biological processes, the SOM-E group was enriched in genes involved in response to stimuli.

Discussion

Ultimately, the accurate assessment of the expressed allele fraction is only possible in the context of the corresponding DNA alleles’ content. Herein, we integrate matching RNA and DNA allele fraction from bi-allelic DNA regions to identify transcriptome-favored alleles. We focus more specifically on somatic point mutations in breast cancer, which we assess for tumorigenic functionality that can underlie selective transcriptome preference. The first striking observation from our study is that transcriptome-preferred alleles are enriched in functional features, which are often predicted to alter the original protein function. This correlation was stronger in the group of genes traditionally acknowledged as tumor suppressors. Tumor suppressors are often lost during progression, and their loss is considered a contribution to tumor growth[40]. In our data we see a strong expression preference towards somatically mutated tumor suppressor transcripts, including such bearing a premature stop-codon. Increased allele expression can be either directly caused by mutation-promoted cis transcription activation, or/and retention of the mutant allele in the transcriptome via positive selection. Both scenarios infer functionality and growth-supportive potential. Conforming with that, highly expressed somatic variants, including SOM-E, were more frequently located in highly conserved and predicted to be functional genomic sequences. Taken together, these data are consistent with gain-of-function mechanism favored by the tumor transcriptome. An active role of over-expressed variants is also supported by the selection for maintaining the expression of a complete, translation-ready transcripts, suggesting a possible role of the altered/shortened proteins in the tumor progression. Indeed, once recognized as tumor suppressors, many of the genes in our SOM-E set, including TP53 are now acknowledged to play more complex roles that include oncogenic action[41-44]. Both inactivation and altering the protein function can be crucial for the tumor development. Regardless the mechanism of action, the above observations mark allelic overexpression as a highly informative metric that can be used to outline functionally enriched somatic datasets. The proportion of SOM-L alleles in our data is generally consistent with other reports[1]. Under-expressed alleles, including SOM-L, also correlated with functional annotations and regulatory motifs, though did not reach the significance of SOM-E. In contrast to SOM-E, SOM-L variants confer features that imply intolerance of the transcriptional machinery to the harbored variant. In the absence of CNAs, several mechanisms could potentially lower allele expression levels of mutation bearing transcripts. A well acknowledged scenario is the surveillance-driven targeting of transcripts with deleterious variants, the most prominent example of which is NMD[1, 45]. A degradation mechanism can also take place where the mutation results in an unstable RNA structure[46]. Finally, a mutation can destroy a binding site for a transcription or splicing factor, thus directly abolishing the expression of the underlying alleles[14]. Additional factors, such as high ER expression levels, are also reported to correlate with a decreased number of expressed somatic mutations[1]. Besides the above mutation-focused mechanisms, SOM-L may result from random under-expression in the tumor transcriptome, and the general infidelity of cancer transcriptional machinery[47, 48]. The later confers higher contribution of randomness towards SOM-L loci, which is likely to dilute functional annotations in this group. Another striking observation from our analysis is the expression pattern of stop-codon mutations. Several recent studies have published decreased expression of stop-codon bearing variants in cancer, and have linked it to NMD[1, 49]. Notably, in our data we see stop-codon bearing alleles over-represented as compared to the reference. Whether these expressed RNAs are translated into shorter proteins is subject of further studies, but this possibility is consistent by the presence of premature stop containing, translation-ready transcripts[1]. While NMD is knowledgeably impaired in cancer, our data suggests gene-selective NMD actions[50-52]. Distinguishing pathogenic mutations from the more prevalent neutral variants constitutes one of the greatest challenges of cancer biology, leading to substantial effort towards developing confident analytic strategies. Modern methods integrate traditional frequency based approaches with expression abundance, functional effects, interaction networks and pathway context[13, 53–60]. Here, we integrate somatic allele fraction with most of the above strategies and the knowledge on tumor driving mechanisms, and evaluate the potential of asymmetric allele expression to predict cancer implicated variants. We document distinct allele signatures of cancer drivers at several levels. First, mutations in known cancer genes from our dataset presented more frequently with extreme allele patterns. An example is TP53, mutation in which were frequently either over-expressed or lost. Second, mutations in known cancer-implicated genes presented with higher allele expression. This was also reflected in the higher percentage of SOM-E variants among the known cancer genes. Third, SOM-E mutation sites were enriched in conservation and functional motifs. Cumulatively, these findings highlight the SOM-E status as a potential indicator for cancer-driving functionality. Based on the above, we list the non-CGC genes whose expression status follows the drivers-enriched SOM-E status (see Table 1); albeit not included in the CGC list, some of these genes have been linked to cancer before and are worth further investigation. In summary, our research illustrates an important correlation between asymmetric alleles and cancer-implicated functionality, and functionality in general, and underscores the vast information content of our strategy to systematically outline asymmetrically expressed alleles. This strategy is applicable to all types of cancer and is now enabled by the growing accessibility of matched DNA and RNA sequence data new tools for their integration and analysis[9, 61, 62].

Methods

TCGA samples selection

We first identified all breast cancer samples for which the following five sets were available: normal exome, normal transcriptome, tumor exome, tumor transcriptome, and CNA data (segmentation file based on Affymetrix SNPv6 array profiling)[12, 60, 63, 64]. All these samples had at purity assessed with at least three of the following five purity estimators: ESTIMATE, ABSOLUTE, LUMP, IHC and the Consensus Purity Estimation (CPE)[65-68]. From these, we excluded samples with extensive (more than 3 standard deviations) number of somatic mutations, possibly due to clustered genomic rearrangements[69, 70]. The remaining 72 samples (Supplementary Table 4) were retained for further analysis. We reviewed the pathology reports and retrieved the available clinical information; data for 41 (57%) of the studied samples was available (See Supplementary Table 4). The highest proportion of the samples were ductal adenocarcinomas, either ER, or ER/PR positive. We did not observe any significantly distinguishing somatic expression patterns, which is likely due to the small sample size. The purity, as assessed by the above-mentioned algorithms, is shown in Supplementary Table 5.

Allele count computation

All the used datasets were generated through paired-end sequencing on an Illumina HiSeq platform. The aligned to the human genome reference (hg38) sequencing reads (Binary Alignment Maps,bams) were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/) and processed downstream through an in-house pipeline. Briefly, for both DNA and RNA datasets variants were called using the mpileup module of SAMtools[70]. The variants were further annotated through SeattleSeq. 147 (http://snp.gs.washington.edu/SeattleSeqAnnotation147/). The alignments together with the variant calls (.vcf) were processed through RNA2DNAlign. RNA2DNAlign produced variant and reference sequencing reads counts for all the variant positions in all four datasets (normal exome, normal transcriptome, tumor exome and tumor transcriptome). The read count assessments were visually examined using Integrative Genome Viewer[72]. We excluded from further analyses variants which (1) were covered with less than 10 sequencing read in the tumor DNA or the RNA sequencing data; (2) reside in known imprinted regions, and (3) reside in area affected by copy number change in the corresponding sample, as defined based on the CNA segmentation file, (4) were present in the normal DNA or RNA, suggestive for germline origin.

Assessment for allele distribution

Allele expression rates within a sample were determined through estimation of the relative abundance of variant over total sequence read counts, expressed as Variant Allele Fraction (VAF). For each somatic mutation, we computed the VAF = n(var)/(n(ref) + n(var)), for both tumor RNA (VAF{tRNA}) and tumor DNA (VAF{tDNA}), where n(ref) and n(var) are the counts of the variant and reference sequencing reads covering the position. To account for allele asymmetries related to DNA, we analyzed VAF{tRNA} in the context of the corresponding VAF{tDNA}. Over-expression of somatic mutations (SOM-E status) was determined as prevalence of variant sequencing reads in the transcriptome (VAF{tRNA} ~ 1), while SOM-L was defined by complete loss of the mutant allele in the transcriptome (VAF{tRNA} ~ 0). All the VAF{tRNA} values were used in a correlation analyses to search for association with functional features. Overall VAFs across the studied datasets were illustrated using Circos plots (See Supplementary Figure 1)[73].

Functional and enrichment analyses

Functional annotations, conservation scores and modeled pathogenicity were extracted using the SeattleSeq annotation 147 (http://snp.gs.washington.edu/SeattleSeqAnnotation147/index.jsp). Pathogenicity was modeled using PolyPhen, CADD and FATHMM models, and Conservation was assessed based on Phast, GREP and Grantham Scores[20-26]. Gene Ontology categories, pathway enrichment and network analysis were assessed using Metacore (Claritive Analytics). Transcription factor binding cites were analyzed using TRANSFAC 7.0[29] and splicing motifs were assessed using SpliceAid2[30].

Statistics

SOM, SOM-E and SOM-L variants were called based on a binomial test for variant and reference sequencing read distribution, as previously described[9]. The distributions of SOM-E and SOM-L across tumor-suppressors, oncogenes, and the rest of the genes in the datasets, as well as the distribution of functional elements across SOM, SOM-E and SOM-L, were assessed using the Fisher exact test, Pearson chi-square test, Kruskal-Wallis rank sum test, linear regression analysis, and the Spearman rank correlation coefficient[18, 74, 75]. Yates’s correction for continuity was applied for tests with less than 5 measurements in any category[76]. The means of the VAF across different mutation types were compared using Student’s t-test[77]. P-values below 0.05 were considered significant. For multiple trials, the significance value was corrected using Benjamini-Hochberg False Discovery Rate (FDR) technique. Supplementary Figures Supplementary Tables 1–5
  72 in total

Review 1.  Nonsense-mediated RNA decay regulation by cellular stress: implications for tumorigenesis.

Authors:  Lawrence B Gardner
Journal:  Mol Cancer Res       Date:  2010-02-23       Impact factor: 5.852

2.  RNA2DNAlign: nucleotide resolution allele asymmetries through quantitative assessment of RNA and DNA paired sequencing data.

Authors:  Mercedeh Movassagh; Nawaf Alomran; Prakriti Mudvari; Merve Dede; Cem Dede; Kamran Kowsari; Paula Restrepo; Edmund Cauley; Sonali Bahl; Muzi Li; Wesley Waterhouse; Krasimira Tsaneva-Atanasova; Nathan Edwards; Anelia Horvath
Journal:  Nucleic Acids Res       Date:  2016-08-30       Impact factor: 16.971

Review 3.  Immune infiltration in human tumors: a prognostic factor that should not be ignored.

Authors:  F Pagès; J Galon; M-C Dieu-Nosjean; E Tartour; C Sautès-Fridman; W-H Fridman
Journal:  Oncogene       Date:  2009-11-30       Impact factor: 9.867

4.  Regulation of nonsense-mediated mRNA decay: implications for physiology and disease.

Authors:  Rachid Karam; Jordan Wengrod; Lawrence B Gardner; Miles F Wilkinson
Journal:  Biochim Biophys Acta       Date:  2013-03-13

5.  Adjusting for background mutation frequency biases improves the identification of cancer driver genes.

Authors:  Perry Evans; Stefan Avey; Yong Kong; Michael Krauthammer
Journal:  IEEE Trans Nanobioscience       Date:  2013-05-16       Impact factor: 2.935

6.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer.

Authors:  Simon A Forbes; David Beare; Prasad Gunasekaran; Kenric Leung; Nidhi Bindal; Harry Boutselakis; Minjie Ding; Sally Bamford; Charlotte Cole; Sari Ward; Chai Yin Kok; Mingming Jia; Tisham De; Jon W Teague; Michael R Stratton; Ultan McDermott; Peter J Campbell
Journal:  Nucleic Acids Res       Date:  2014-10-29       Impact factor: 16.971

7.  Frequent mutations in acetylation and ubiquitination sites suggest novel driver mechanisms of cancer.

Authors:  Soumil Narayan; Gary D Bader; Jüri Reimand
Journal:  Genome Med       Date:  2016-05-12       Impact factor: 11.117

8.  Negative selection maintains transcription factor binding motifs in human cancer.

Authors:  Ilya E Vorontsov; Grigory Khimulya; Elena N Lukianova; Daria D Nikolaeva; Irina A Eliseeva; Ivan V Kulakovskiy; Vsevolod J Makeev
Journal:  BMC Genomics       Date:  2016-06-23       Impact factor: 3.969

9.  Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models.

Authors:  Hashem A Shihab; Julian Gough; David N Cooper; Peter D Stenson; Gary L A Barker; Keith J Edwards; Ian N M Day; Tom R Gaunt
Journal:  Hum Mutat       Date:  2012-11-02       Impact factor: 4.878

10.  MUFFINN: cancer gene discovery via network analysis of somatic mutation data.

Authors:  Ara Cho; Jung Eun Shim; Eiru Kim; Fran Supek; Ben Lehner; Insuk Lee
Journal:  Genome Biol       Date:  2016-06-23       Impact factor: 13.583

View more
  2 in total

1.  Estimating the Allele-Specific Expression of SNVs From 10× Genomics Single-Cell RNA-Sequencing Data.

Authors:  Prashant N M; Hongyu Liu; Pavlos Bousounis; Liam Spurr; Nawaf Alomran; Helen Ibeawuchi; Justin Sein; Dacian Reece-Stremtan; Anelia Horvath
Journal:  Genes (Basel)       Date:  2020-02-25       Impact factor: 4.096

2.  Systematic pan-cancer analysis of somatic allele frequency.

Authors:  Liam Spurr; Muzi Li; Nawaf Alomran; Qianqian Zhang; Paula Restrepo; Mercedeh Movassagh; Chris Trenkov; Nerissa Tunnessen; Tatiyana Apanasovich; Keith A Crandall; Nathan Edwards; Anelia Horvath
Journal:  Sci Rep       Date:  2018-05-16       Impact factor: 4.379

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.