| Literature DB >> 31031799 |
Simon A Hardwick1,2,3, Samuel D Bassett1,2, Dominik Kaczorowski1, James Blackburn1,2, Kirston Barton1, Nenad Bartonicek1,2, Shaun L Carswell1, Hagen U Tilgner3, Clement Loy1,4, Glenda Halliday4, Tim R Mercer1,2,5, Martin A Smith1,2, John S Mattick1,2,6.
Abstract
The human brain is one of the last frontiers of biomedical research. Genome-wide association studies (GWAS) have succeeded in identifying thousands of haplotype blocks associated with a range of neuropsychiatric traits, including disorders such as schizophrenia, Alzheimer's and Parkinson's disease. However, the majority of single nucleotide polymorphisms (SNPs) that mark these haplotype blocks fall within non-coding regions of the genome, hindering their functional validation. While some of these GWAS loci may contain cis-acting regulatory DNA elements such as enhancers, we hypothesized that many are also transcribed into non-coding RNAs that are missing from publicly available transcriptome annotations. Here, we use targeted RNA capture ('RNA CaptureSeq') in combination with nanopore long-read cDNA sequencing to transcriptionally profile 1,023 haplotype blocks across the genome containing non-coding GWAS SNPs associated with neuropsychiatric traits, using post-mortem human brain tissue from three neurologically healthy donors. We find that the majority (62%) of targeted haplotype blocks, including 13% of intergenic blocks, are transcribed into novel, multi-exonic RNAs, most of which are not yet recorded in GENCODE annotations. We validated our findings with short-read RNA-seq, providing orthogonal confirmation of novel splice junctions and enabling a quantitative assessment of the long-read assemblies. Many novel transcripts are supported by independent evidence of transcription including cap analysis of gene expression (CAGE) data and epigenetic marks, and some show signs of potential functional roles. We present these transcriptomes as a preliminary atlas of non-coding transcription in human brain that can be used to connect neurological phenotypes with gene expression.Entities:
Keywords: GWAS; RNA-seq; brain; haplotype blocks; long-read sequencing; non-coding RNA; sequins
Year: 2019 PMID: 31031799 PMCID: PMC6473190 DOI: 10.3389/fgene.2019.00309
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Schematic outline of experimental design. (A) First, haplotype blocks were predicted around 1,352 non-coding GWAS SNPs (colored circles) associated with neurological phenotypes. Blocks were defined by identifying all SNPs in linkage disequilibrium (LD) with the GWAS SNP (white circles). Of the 1,023 blocks, 780 overlap an annotated exon (GENCODE v24), 81 are located entirely within an annotated intron, and 162 are intergenic. (B) Biotinylated oligonucleotide probes (orange bars) were designed to tile across haplotype blocks (with annotated protein-coding exons and repeat elements omitted). (C) Probes are used as baits to capture any transcripts generated from targeted regions, followed by pull-down enrichment and subsequent transcriptional profiling with both long- and short-read RNA sequencing. (D) Sequence reads are aligned to the genome (hg38) and a hybrid transcriptome is assembled by leveraging the advantages of both long- and short-read RNA-seq.
FIGURE 2Validation of CaptureSeq design and ONT sequencing. (A) A subset of RNA sequins were targeted for capture as part of our CaptureSeq design (n = 25/78 genes; 49/164 isoforms). By plotting the measured abundance (TPM; y-axis) against the known input concentration (x-axis) for captured (red) and non-captured (green) sequin isoforms, we can compare the quantitative accuracy of captured vs. non-captured transcripts. Open circles indicate sequin isoforms that were not detected. Vertical dotted green line indicates the limit of detection (LoD) for non-captured transcripts. By comparing the difference between the measured abundance of captured and non-captured transcripts, we observe a ∼230-fold enrichment of CaptureSeq. Error bars represent standard deviation (SD) between the four replicate ONT samples. (B) Scatter plot shows the coefficient of variation (SD divided by mean) of each spike-in plotted against its respective input concentration, indicating the expression dependent bias of RNA-seq.
FIGURE 3Transcriptional landscape of haplotype blocks associated with neuropsychiatric functions. (A) Bar charts show the number of transcripts, introns and internal exons contained in our filtered hybrid transcriptome (red) compared to existing annotations (four versions of GENCODE and MiTranscriptome v2). Only multi-exonic transcripts that overlap with targeted haplotype blocks are considered in this analysis. (B) Table shows the classification of transcripts in our hybrid transcriptome in relation to the latest GENCODE annotation (v29). (C) Cumulative frequency histograms show the coding potential of our transcripts, as assessed by CPAT. Colors refer to the categories defined in part (B). Vertical dotted line indicates the commonly used cut-off for human transcripts (0.364). GENCODE (v29) annotated lncRNAs (gray dotted line) and protein-coding genes (gray solid line) are also plotted for reference. (D) Box plots show the distribution of phastCons scores (Siepel et al., 2005) (vertebrate 100-way alignment) of our transcripts, colored by type. Box edges indicate lower and upper quartiles, center lines indicate median, notches indicate the 95% confidence interval around the median. (E) Density plots show the mean expression (log10 TPM) of transcripts across all four samples, as measured by Illumina short-read sequencing.
FIGURE 4Identification of novel intergenic transcripts. (A) Bar charts indicate the fraction of transcription start sites (TSSs) occupied by cap analysis of gene expression (CAGE) peaks (blue), as well as the three epigenetic marks that are typically associated with actively transcribed promoters: H3K4me1 (red), H3K4me3 (green), and H3K27ac (purple). GENCODE (v29) lncRNAs and protein-coding genes are also plotted for reference. (B) Bar charts show the enrichment of the promoter regions of transcripts for CAGE peaks. Odds ratio of enrichment is plotted for novel intergenic transcripts compared to lncRNAs and protein-coding genes. (C, upper) Genome browser view shows a novel intergenic locus identified overlapping a GWAS haplotype block (solid black bar, top) associated with smoking behavior (rs1847461) on chromosome 12. Transcripts from our filtered hybrid transcriptome are shown below, followed by spliced ONT sequencing coverage (red), spliced Illumina sequencing coverage (blue), PhyloP conservation track, and CAGE robust peaks. (C, lower) Two separate magnified views show novel TSSs supported by CAGE peaks and highly conserved promoter regions. ONT read alignments are also shown.
FIGURE 5Identification of novel internal coding exons in RIMS2 gene. (A) Genome browser view shows a ∼570 kb GWAS haplotype block (solid black bar, top) on chromosome 8 associated with cocaine dependence (rs75688122). CaptureSeq detected multiple novel splice isoforms of RIMS2, a gene involved in neurotransmitter release. (B) Magnified views show three novel, highly conserved exons detected in the first intron of RIMS2, which are predicted to collectively add 56 amino acids to the start of the RIMS2 protein. The first novel internal exon includes a start codon (left), while the second and third exons (right) are in-frame (33 and 87 bp, respectively).