| Literature DB >> 26474385 |
Andreas M Hoff1,2,3, Bjarne Johannessen1,2,3, Sharmini Alagaratnam1,2,3, Sen Zhao1,2,3, Torfinn Nome1,2,3, Marthe Løvf1,2,3, Anne C Bakken1,2,3, Merete Hektoen1,2,3, Anita Sveen1,2,3, Ragnhild A Lothe1,2,3, Rolf I Skotheim1,2,3.
Abstract
With an annual estimated incidence of 1.4 million, and a five-year survival rate of 60%, colorectal cancer (CRC) is a major clinical burden. To identify novel RNA variants in CRC, we analyzed exon-level microarray expression data from a cohort of 202 CRCs. We nominated 25 genes with increased expression of their 3' parts in at least one cancer sample each. To efficiently investigate underlying transcript structures, we developed an approach using rapid amplification of cDNA ends followed by high throughput sequencing (RACE-seq). RACE products from the targeted genes in 23 CRC samples were pooled together and sequenced. We identified VWA2-TCF7L2, DHX35-BPIFA2 and CASZ1-MASP2 as private fusion events, and novel transcript structures for 17 of the 23 other candidate genes. The high-throughput approach facilitated identification of CRC specific RNA variants. These include a recurrent read-through fusion transcript between KLK8 and KLK7, and a splice variant of S100A2. Both of these were overrepresented in CRC tissue and cell lines from external RNA-seq datasets.Entities:
Keywords: RACE-seq; colorectal cancer; fusion genes; splicing; transcript variants
Mesh:
Substances:
Year: 2015 PMID: 26474385 PMCID: PMC4742197 DOI: 10.18632/oncotarget.5500
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Top 25 genes with elevated 3′ expression in individual CRCs
| Gene symbol | Ensembl ID | Chromosome | Strand | Deviating sample | Expression Break score (EBS) | Type |
|---|---|---|---|---|---|---|
| ENSG00000132744 | 11 | −1 | Sample12_T | 2.60 | 5′ RACE candidate | |
| ENSG00000244617 | 2 | −1 | Sample5_T | 4.00 | 5′ RACE candidate | |
| ENSG00000136881 | 9 | −1 | Sample13_T | 3.67 | 5′ RACE candidate | |
| ENSG00000131050 | 20 | 1 | Sample16_T | 2.65 | 5′ RACE candidate | |
| ENSG00000131686 | 1 | 1 | Sample9_T | 4.16 | 5′ RACE candidate | |
| ENSG00000198756 | 1 | −1 | Sample19_T | 4.67 | 5′ RACE candidate | |
| ENSG00000164434 | 6 | 1 | Sample10_T | 5.67 | 5′ RACE candidate | |
| ENSG00000114279 | 3 | −1 | Sample12_T | 3.37 | 5′ RACE candidate | |
| ENSG00000152402 | 11 | −1 | Sample1_T | 3.68 | 5′ RACE candidate | |
| ENSG00000123407 | 12 | 1 | Sample15_T | 4.61 | 5′ RACE candidate | |
| ENSG00000095752 | 19 | −1 | Sample20_T | 3.47 | 5′ RACE candidate | |
| ENSG00000148798 | 10 | 1 | Sample14_T | 5.73 | 5′ RACE candidate | |
| ENSG00000169035 | 19 | −1 | Sample2_T | 3.91 | 5′ RACE candidate | |
| ENSG00000167916 | 17 | −1 | Sample5_T | 3.45 | 5′ RACE candidate | |
| ENSG00000167656 | 8 | −1 | Sample5_T | 3.15 | 5′ RACE candidate | |
| ENSG00000124466 | 19 | −1 | Sample5_T | 5.06 | 5′ RACE candidate | |
| ENSG00000009724 | 1 | −1 | Sample4_T | 3.25 | 5′ RACE candidate | |
| ENSG00000124003 | 2 | 1 | Sample6_T | 2.29 | 5′ RACE candidate | |
| ENSG00000169550 | 11 | −1 | Sample5_T | 3.81 | 5′ RACE candidate | |
| ENSG00000171840 | 12 | −1 | Sample8_T | 3.40 | 5′ RACE candidate | |
| ENSG00000120669 | 13 | −1 | Sample3_T | 2.75 | 5′ RACE candidate | |
| ENSG00000196754 | 1 | −1 | Sample5_T | 5.85 | 5′ RACE candidate | |
| ENSG00000112499 | 6 | −1 | Sample17_T | 4.33 | 5′ RACE candidate | |
| ENSG00000169507 | 2 | −1 | Sample19_T | 4.54 | 5′ RACE candidate | |
| ENSG00000132639 | 20 | 1 | Sample7_T | 3.08 | 5′ RACE candidate | |
| ENSG00000225292 | 10 | 1 | HCT116 | NA | Positive control | |
| ENSG00000148737 | 10 | 1 | NCI-H508 | NA | Positive control | |
| ENSG00000112299 | 6 | −1 | HT29 | NA | Positive control |
The top 25 genes with elevated 3′ expression, identified by analysis of exon microarray data, were selected for follow-up transcript characterization by 5′ RACE and sequencing. In addition to the 25 top-scoring genes, three genes with previously known transcriptional changes were included as positive controls.
Figure 1Pipeline to identify and characterize novel RNA variants in CRC
Analysis of genome-scale exon level microarray data revealed 25 candidate genes with overexpression of their 3′ parts, here exemplified with the BPIFA2 gene. The candidate genes were characterized with RACE-seq, a combination of 5′ RACE and deep sequencing. For the 25 candidate genes and also 3 positive control genes, nested RACE-primers were designed downstream of the suspected breakpoints (orange arrows). The resulting pools of RACE fragments (28 assays per sample) were prepared for sequencing with the Nextera XT protocol (Illumina), using tagmentation to simultaneously fragment and tag RACE-amplicons with adapters for sequencing. The fusion transcripts VWA2-TCF7L2, DHX35-BPIFA2 and CASZ1-MASP2, and also 55 transcript junctions were identified by two separate computational approaches from the RACE-seq data. These were probed for in external datasets from TCGA, CCLE and the Illumina body map.
Fusion transcripts detected by the RACE-seq approach
| Gene A | Gene B | Break position A | Break position B | Split reads | Spanning reads | In Frame | Probability score | Validated | Sample |
|---|---|---|---|---|---|---|---|---|---|
| Chr10: 114,220,341 | Chr10: 114,900,943 | 1234 | 163 | Yes | 0.96 | Previously reported | NCI-H508 | ||
| Chr10: 114,799,885 | Chr10: 114,648,494 | 114 | 41 | No | 0.95 | Previously reported | HCT-116 | ||
| Chr10: 116,014,807 | Chr10: 114,900,943 | 206 | 67 | Yes | 0.98 | Yes | Sample 17_T | ||
| Chr20: 37,632,550 | Chr20: 31,767,410 | 862 | 4 | Yes | 0.23 | Yes | Sample 16_T | ||
| Chr1: 10,820,757 | Chr1: 11,090,938 | 107 | NA | Yes | NA | Yes | Sample 4_T |
Both previously known as well as novel fusion transcripts were detected by the RACE-seq approach and subsequent analysis of paired-end sequencing reads. Only the top-scoring nominated candidates are shown for each fusion pair. The known as well as the novel fusion transcripts VWA2-TCF7L2 and DHX35-BPIFA2 were nominated by the fusion detection software, whereas the novel CASZ1-MASP2 was detected by the Unix utility grep and scaffold realignment.
The number of split reads and spanning reads supporting the fusion junction, determined from the fusion detection software or scaffold realignment for the CASZ1-MASP2 fusion (split reads contain the fusion boundary in the read itself, whereas spanning reads are paired ends that harbor the fusion boundary within the insert sequence).
Probability score determined by the fusion detection software.
Two fusion genes involving TCF7L2 have previously been reported and served as positive control fusion targets for the RACE-seq. The novel fusions were validated by Sanger sequencing.
Figure 2The novel fusion transcript VWA2-TCF7L2
A. As indicated by red arrows, a RACE assay with first and second round primers targeting exon 7 and 6 of TCF7L2 respectively was included in the sequencing setup as a positive control. In addition to enabling identification of previously known fusions involving TCF7L2, a new fusion was discovered that connects exon 4 of VWA2 to exon 6 of TCF7L2. The exon numbers refer to transcript structures ENST00000392982 for VWA2 and ENST00000369395 for TCF7L2. The 5′ partner gene, VWA2, is located 1.1 Mbp downstream of TCF7L2. The four exons upstream of the fusion breakpoint in VWA2 show high read coverage, measured in reads per kilobase exon sequence (RPK). On the contrary, the exons after the breakpoint in VWA2 were not picked up by the assay, indicating that the RACE assay targeting TCF7L2 specifically amplified the upstream VWA2 part in the tumor sample harboring the VWA2-TCF7L2 fusion. B. Normalized read counts mapping to the VWA2 gene for all samples included in the RACE-seq experimental set up. Only the sample with the identified VWA2-TCF7L2 fusion transcript had sequencing reads covering the VWA2 genes.
Figure 3Novel fusion between DHX35 and BPIFA2 is in concordance with 3′ overexpression of BPIFA2
A. The exon expression profile of BPIFA2 shows that one CRC sample has a 3-fold increase in expression of the 3′ part of the gene compared to the median of the cohort. The last four probe sets showed increased intensity levels and target exon 7, 8 and 9 of BPIFA2. B. Normalized read counts mapping to DHX35 were high only in sample16_T, the same sample which exhibits increased 3′ expression of BPIFA2. C. From RACE-seq we identified a fusion between exon 11 of DHX35 and exon 7 of BPIFA2. Exons 1–11 of DHX35 show high read coverage, measured in RPK. Exons 12–22, located downstream of the breakpoint in DHX35, were not covered by any reads, indicating that the RACE assay that targets BPIFA2 specifically amplifies the upstream DHX35 part of the fusion. By realigning sequencing reads from the sample in question to a fusion scaffold, we found that 207 unique split reads align across the fusion boundary.
Figure 4Novel fusion between CASZ1 and MASP2 is in concordance with 3′ overexpression of MASP2
A. The exon expression profile of MASP2 shows that one CRC sample has a 3- to 5-fold increase in expression of the 3′ part of the gene compared to the median of the cohort. The last six probe sets showed increased intensity and targeted exon 9, 10 and 11 of MASP2. B. Normalized read counts mapping to CASZ1 were high only in sample 4_T, the same sample which exhibit increased 3′ expression of MASP2. C. From RACE-seq we identified that the gene CASZ1 had high read coverage of its first two exons in the same sample that exhibited the deviating exon expression profile for MASP2. No fusions involving MASP2 were nominated by deFuse, but when using the grep command for parts of the MASP2 NGSP sequence (red arrow), we identified several reads containing a fusion boundary between CASZ1 exon 2 and MASP2 exon 9. Upon realigning sequencing reads, we found that 107 unique split reads aligns across the fusion boundary.
Figure 5Read-through from upstream KLK8 to KLK7 in samples with deviating 3′ expression of KLK7
A. The exon expression profile of KLK7 shows that several samples have increased expression of probe sets targeting exon 3 to 6. The top two samples (2_T & 5_T) were selected for RACE-seq to identify underlying transcript structural changes. B. A sashimi plot from IGV shows the alignment of sequencing reads from the two nominated samples for KLK7. The height of the bars represents the number of aligned reads, while arcs represent junctions connected to exon 3 of KLK7 and the coverage of these junctions, as determined by Tophat2 alignment and the sashimi plot package, are shown as numbers on the arcs.
Validation frequencies of KLK8-KLK7 and S100A2 junctions
| Gene | Chromosome | Pos 1 | Pos 2 | Distance | RACE-Seq | TCGA_tumor | TCGA_normal | CCLE | Body map |
|---|---|---|---|---|---|---|---|---|---|
| 19 | 51485170 | 51504353 | 19183 | 10/22 (45%) | 6/24 (25%) | 0/24 | 15/56 (27%) | 1/16 | |
| 1 | 153536357 | 153537981 | 1624 | 16/22 (73%) | 14/24 (58%) | 0/24 | 33/56 (59%) | 1/16 |
The novel junctions covering the read-through of KLK8 and KLK7 and the alternative 3′ splice site of S100A2 were discovered from the RACE-seq data. The junctions were validated by aligning all RACE-seq sequences and external data sets from the TCGA and the CCLE.
The number of reads for the junctions to be considered positive were >10, >=2 and >= 1 for the RACE-seq, CCLE and TCGA tumor/normal data sets, respectively.
From the Illumina Human body map data set, one read covering the KLK8-KLK7 read-through was identified in normal breast tissue, and 12 reads were found to cover the S100A2 alternative splice site in normal lung tissue.
Figure 6A novel 3′ splice-site in S100A2 is found to be overrepresented in CRC
A. Two samples had increased expression levels of probe sets targeting exons 2 and 3 of S100A2 (UCSC annotation). Ensembl release 75 has annotated additional transcript variants (in red; ENST00000368710, ENST00000368709) and upstream 5′ UTR exons that are not targeted by exon microarray probe sets. B. We identified the use of a novel 3′ splice-site when splicing exon 1 to exon 2. The splice-site is located 6 bp downstream of the canonical splice-site. Use of this alternative splice-site was found to be overrepresented in CRC samples (see Table 3). The splice-site occurs in the 5′ UTR (coding sequence in red), and according to our analysis does not alter the annotated ORF.