| Literature DB >> 29284497 |
N Bartonicek1,2, M B Clark1,3, X C Quek1,2, J R Torpy1,2, A L Pritchard4, J L V Maag1,2, B S Gloss1,2, J Crawford5, R J Taft5,6, N K Hayward4, G W Montgomery5, J S Mattick1,2, T R Mercer1,2,7, M E Dinger8,9.
Abstract
BACKGROUND: Genotyping of large populations through genome-wide association studies (GWAS) has successfully identified many genomic variants associated with traits or disease risk. Unexpectedly, a large proportion of GWAS single nucleotide polymorphisms (SNPs) and associated haplotype blocks are in intronic and intergenic regions, hindering their functional evaluation. While some of these risk-susceptibility regions encompass cis-regulatory sites, their transcriptional potential has never been systematically explored.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29284497 PMCID: PMC5747244 DOI: 10.1186/s13059-017-1363-3
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Capturing novel transcripts from intronic and intergenic GWAS regions. a Schematic of the experimental design. LD blocks were predicted around GWAS SNPs (colored pins) by identifying proxy (i.e. co-inherited) SNPs (r2 > 0.5) from Hapmap23 and 1000 genomes (white pins). Oligonucleotide probes were designed for 561 intronic and intergenic GWAS regions and hybridized to transcriptomes of 21 target tissues. The captured transcripts were sequenced, assembled, and mapped back to the genome. b Enrichment of captured transcripts. Expression of all captured (red) and non-captured (black) transcripts annotated in GENCODE (v.19) was compared between testis CaptureSeq sample (y-axis) vs testis RNA-seq from Illumina Body Atlas (x-axis). Correlation coefficients are 0.29 for captured transcripts and 0.55 for GENCODE genes. FPKM: fragments per kilobase of transcripts per million mapped reads. c Occupancy of 561 intergenic haploblocks by multi-exonic captured transcripts. The majority of haploblocks (84.8%) contain at least one transcript with FPKM > 1. d Counts of captured multi-exonic transcripts with FPKM > 1 across tissues
Fig. 2Defining properties of novel transcripts. a Previous observation of portions of captured transcripts in public databases. Percent of captured transcripts overlapping previously annotated transcripts in GENCODE at the time of the experiment design (v.12), GENCODE v.19 and v.27, AceView, MiTranscriptome, and the EST database. Gray shades indicate length overlap between the novel transcript and the previously observed sequences. b Aggregated data for cap analysis gene expression (CAGE) clusters, centered on the 5’ end of captured transcripts. Counts are normalized by the number of transcripts. Positive control was defined as lncRNAs transcripts with the same median of expression distribution across tissues as captured transcripts, from Illumina Body Map data. X-axes indicate distance from the 5’ start of transcripts in base pairs. Y-axes represent counts of CAGE clusters, normalized by the number of transcripts (see “Methods”). c Fraction of promoters of captured transcripts, lncRNAs, pseudogenes, and protein-coding genes occupied by CAGE and epigenetic marks: CAGE (blue), H3K4me3 (red), H3K27ac (yellow), H3K4me1 (purple). Hollow circles represent randomized controls, whereby CAGE and epigenetic peaks were randomly distributed across the genome
Fig. 3Functional properties of captured transcripts. a Comparison of tissue-specific expression of captured transcripts to lncRNAs, pseudogenes, and protein-coding genes (Illumina Body Map), as measured by Tau tissue specificity index (0 for broadly expressed, 1 for tissue specific genes) [79]. b Heatmap of tissue-specific captured transcripts (τ > 0.8) across tissues. Unsupervised clustering performed on τ components (1-Expression/max(Expression)), colored by tissue specificity from low (white) to high (red). Statistically significantly non-randomly clustered branches after 10,000 bootstraps, as calculated by Pvclust [115] are marked by red rectangles. **p value of a cluster branch < 10–3. c Enrichment of genomic regions of captured transcripts for known FANTOM enhancers. Log odds ratios (ORs) of enrichment (with 95% confidence intervals) compared to lncRNAs, pseudogenes, and protein-coding genes. Genomic regions of both introns and exons were included in the analysis. FANTOM enhancers in red, randomized regions in blue. d Enrichment of GWAS SNPs in transcript regions. Log OR of enrichment for GWAS SNPs (p value < 5 × 10-8), compared to intronic regions. Exons in red, promoters in yellow, 3’ UTRs in blue. Hollow circles denote enrichment for common SNPs. Statistically significant adjusted p values (Χ2 test, p values < 0.05) are denoted with asterisks. e Example of a captured transcript with independently validated function. Transcript GCS1669 overlaps three known lncRNAs, with CCAT1 being functionally validated in liver and prostate carcinogenesis. Gray box marks captured region. Previously observed splice sites are denoted in red. f Expression levels of transcript GCS1669 across tissues
Examples of captured transcripts with exonic eQTLs. Protein-coding genes whose expression is influenced by eQTLs are characterized by their function and tissue expression in GTEx. In brackets: fold change overexpression of associated genes in specific tissues compared to their average expression, as given by GTEx or Human Integrated Protein Expression Database (HIPED) [118] in case of KALRN
| Captured transcript | Highest tissue expression | Haploblock associated phenotype | eQTL | Associated gene | Gene function | Tissue expression |
|---|---|---|---|---|---|---|
| GCS0300 | Cervix | Prostate cancer | rs72928357 | MYEOV | Stimulation of cancer growth and proliferation [ | Vagina (2.6x) |
| GCS0406 | Heart | HDL cholesterol | rs7134375 | PDE3A | Hypertension, fat metabolism [ | Heart (19x) |
| GCS0736 | Liver, thyroid | HDL cholesterol | rs11875196 | LIPG | Modulation of HDL cholesterol [ | Liver (14x), thyroid (78x) |
| GCS1080 | Heart | Mean platelet volume | rs13058993 | KALRN | Activates Rho GTPases to regulate actin cytoskeleton [ | Platelets (10x, HIPED), heart (2x) |
| GCS1212 | Thyroid | Thyroid function | rs4835532 | Mineralocorticoid receptor (NR3C2) | Regulation of cellular ion concentrations [ | Thyroid (7.0x) |
| GCS0965 | Testes | Age at first menstruation | rs708984 | PCSK2 | Conversion of proinsulin to insulin [ | Testis (2x), thyroid (15x) |
| GCS1190 | Kidney | Metabolic traits in urine | rs2348209 | ENPEP | Peptide cleavage [ | Kidney (11x) |
Fig. 4Identification of novel transcripts expressed in skin cutaneous melanoma. a of novel transcript types identified through CaptureSeq on 13 melanoma transcriptomes, targeting regions proximal to key melanoma genes. Red lines denote novel splice junctions, red blocks novel exons, and gray boxes the captured regions. From top to bottom: a fusion protein between GRM5 and NOX4, novel exons on ACAT1, novel lncRNAs bidirectional to ENSA, and an antisense lncRNAs overlapping ADAMTLS4. b, c Violin plot of fold change of novel transcripts in primary tumor and metastatic samples vs normal. Red dots denote significant differences (FDR < 0.01). Out of 31 novel transcripts, 22 were detectable in both TCGA primary tumors and metastatic samples. d Kaplan–Meier survival curve for captured transcript GCSM011 in metastatic melanomas. Red lines mark the groups in the upper half of transcript expression and blue for the lower half. e Schematic representation of genomic loci of GCSM011 between MCL1 and ENSA. f Allelic imbalance of captured transcripts in haploblocks associated with melanoma. Heterozygous sites were predicted with QuASAR [116] and allelic imbalance calculated with MBASED [117]. Y-axis represents median allelic expression across heterozygous SNPs. Allelic imbalance displayed as absolute value of 0.5 – allelic imbalance. Homozygous and heterozygous SNPs with allelic or without allelic imbalance are shown in blue, red, and yellow, respectively. At least 30 reads had to be observed over a SNP, with significance cutoff of FDR < 0.1