| Literature DB >> 26430802 |
Halit Ongen1, Emmanouil T Dermitzakis2.
Abstract
With the advent of RNA-sequencing technology, we can detect different types of alternative splicing and determine how DNA variation regulates splicing. However, given the short read lengths used in most population-based RNA-sequencing experiments, quantifying transcripts accurately remains a challenge. Here we present a method, Altrans, for discovery of alternative splicing quantitative trait loci (asQTLs). To assess the performance of Altrans, we compared it to Cufflinks and MISO in simulations and Cufflinks for asQTL discovery. Simulations show that in the presence of unannotated transcripts, Altrans performs better in quantifications than Cufflinks and MISO. We have applied Altrans and Cufflinks to the Geuvadis dataset, which comprises samples from European and African populations, and discovered (FDR = 1%) 1,427 and 166 asQTLs with Altrans and 1,737 and 304 asQTLs with Cufflinks for Europeans and Africans, respectively. We show that, by discovering a set of asQTLs in a smaller subset of European samples and replicating these in the remaining larger subset of Europeans, both methods achieve similar replication levels (95% for both methods). We find many Altrans-specific asQTLs, which replicate to a high degree (93%). This is mainly due to junctions absent from the annotations and hence not tested with Cufflinks. The asQTLs are significantly enriched for biochemically active regions of the genome, functional marks, and variants in splicing regions, highlighting their biological relevance. We present an approach for discovering asQTLs that is a more direct assessment of splicing compared to other methods and is complementary to other transcript quantification methods.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26430802 PMCID: PMC4596912 DOI: 10.1016/j.ajhg.2015.09.004
Source DB: PubMed Journal: Am J Hum Genet ISSN: 0002-9297 Impact factor: 11.025
Figure 1Schematic of the Altrans Algorithm
(A) Overlapping exons are grouped into exon groups where identical exons belonging to multiple transcripts are treated as one unique entity. Two transcripts, shown as connected brown and green boxes, result in two exon groups and three exons shown as blue boxes. Next, the unique regions of each exon, depicted as light blue boxes and a subscript u followed by the level of the exon, are identified. Because E2 has a region that is not shared by any other exon, it is assigned a “level” of 1, and the reads aligning to E2u,1 can be unambiguously assigned to E2. E1 does not have a unique portion, and therefore the level 1 exon, E2, is removed from the exon group and the whole of E1 becomes a unique portion, shown as an empty blue box, with a level of 2. These unique regions are used when assigning mate pairs to links as shown with the red lines where the solid portions of the line are the sequenced mates and the dashed part represents the inferred insert.
(B) The default method for calculating link coverage. Link coverage is necessary to normalize the observed counts for the length of the unique portions being linked and the insert size. The theoretical minimum and maximum insert sizes linking the two unique portions, represented as brown and green lines, respectively, are calculated and given the empirically determined insert size distribution, and the area under the curve between the minimum and maximum insert sizes is estimated. The link coverage equals the number of mate pairs linking the two unique portions over the ratio of this area to the area of the whole insert size distribution.
(C) The degrees of freedom method for determining link coverage. Here given a read length and insert size of 3 and two exons that are 6 and 5 bases long, there are three mate pair alignments that can link these two exons. Therefore, the degrees of freedom refer to the theoretical number of positions where a mate pair (given, in this case, 3+3+3 = 9 bp long fragment size) exists that links these exons on the mRNA, shown as black lines. The link coverage is the number of mate pairs linking the exons over the degrees of freedom.
(D) The equation to calculate F value for a link.
(E) A worked example of calculation of the F values. First the coverage of E2 to E3 link (CE2 − E3) is determined from level 1 unique regions (CE2u,1 − E3u,1), which is then subtracted from the coverage attained from the pseudo-unique E1 to E3 link (CE1u,2 − E3u,1) in order to calculate the true E1 − E3 coverage (CE1 − E3). In the forward direction, E1 and E2 become primary exons and in the reverse direction E3 is the primary exon and the corresponding F values are calculated as shown.
Figure 2Simulation Results
Using Flux Simulator, we ran six simulations with varying levels of unannotated transcripts. Subsequently, we ran quantifications with three methods with the known GENCODE v.12 annotation. We compared the simulated versus measured link quantifications via Spearman’s rank correlation. These comparisons are shown as colored solid lines. In order to produce a null random distribution for each method, we took the link quantifications for each gene, permutated these for 100 times within the links of this gene, and measured the correlation of these random assignments with the simulated ones. By using this sampling method stratified by genes, we account for the variability of number of isoforms per gene. These correlations for random assignments are shown as dashed lines. We observe that as the percentage of novel transcripts increase, the performance of Cufflinks and MISO suffer, whereas this is not the case for Altrans, which results in best quantifications with increased levels of unannotated transcripts.
Number of Genes Tested and asQTLs Discovered at FDR = 1% in Each Population and by Each Method
| EUR | 7,443 | 1,427 | 7,148 | 1,737 | 780 | 1.3 × 10−4 |
| YRI | 7,720 | 166 | 7,391 | 304 | 76 | 1.2 × 10−4 |
The overlap column lists the common genes between the methods and the p value refers to this overlap arising by chance.
Figure 3asQTL Discovery
(A) The relative distance of asQTLs to the TSS versus the p value.
(B) Mosaic plots of gene level sharing of asQTLs for each method at FDR = 1%.
(C) The p value distributions of a variant-link pair tested in the other population for each method. From these p value distributions, the π1 statistic is calculated that estimates the proportion of true positives.
Figure 4Functional Enrichments of asQTLs Discovered by Altrans and Cufflinks
All variants identified in separate populations are merged. The null (frequency and distance matched) is represented as the black horizontal line. The numbers above each bar are the −log10 p values of the enrichment, Altrans enrichment p value followed by Cufflinks p value.