| Literature DB >> 19417075 |
Abstract
The complexity of mammalian transcriptomes is compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. However, transcriptome comparison has been limited to differential analysis at the gene level instead of the individual transcript isoform level. High-throughput sequencing technologies and high-resolution tiling arrays provide an unprecedented opportunity to compare transcriptomes at the level of individual splice variants. However, sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. Here we propose a hierarchical Bayesian model, BASIS (Bayesian Analysis of Splicing IsoformS), to infer the differential expression level of each transcript isoform in response to two conditions. A latent variable was introduced to perform direct statistical selection of differentially expressed isoforms. Model parameters were inferred based on an ergodic Markov chain generated by our Gibbs sampler. BASIS has the ability to borrow information across different probes (or positions) from the same genes and different genes. BASIS can handle the heteroskedasticity of probe intensity or sequence read coverage. We applied BASIS to a human tiling-array data set and a mouse RNA-seq data set. Some of the predictions were validated by quantitative real-time RT-PCR experiments.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19417075 PMCID: PMC2691848 DOI: 10.1093/nar/gkp282
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Hierarchical structure of BASIS. The observed data are denoted as rectangles. The random variables besides Y are denoted as ovals. The hyperparameters are listed in brackets. A solid arrow indicates a stochastic dependence while a dashed arrow indicates a logical function. The details of BASIS can be found in Materials and Methods section.
Figure 2.Workflow of the prescreening steps and BASIS results.
Figure 3.Heteroskedasticity of probe intensity and sequence read coverage. (A and B) Multiplicative error of probe intensity (A) and sequence read coverage (B). The x-axis represents the mean probe intensity across three replicates (A) or the mean sequence read coverage across two replicates (B). The y-axis represents the corresponding standard deviation of probe intensity or sequence read coverage. In (A), 500 000 normalized probe intensity data obtained from the human HeLa tiling array were used. In (B), normalized sequence read coverage at 500 000 nucleotide positions from the mouse liver RNA-seq data were used. x and y are plotted in a log2 scale for visual convenience. (C and D) Histograms demonstrating the number of different bins in a given gene for the tiling array data (C) and the RNA-seq data (D). Probes or positions were divided into 100 bins according to their intensity or sequence read coverage. For each gene, we counted the number of different bins that its probes or positions belong to. The number of genes with a specific number of bins was shown in the histograms.
Performance of BASIS and the least squares fit
| BASIS | Least squares fit | |||
|---|---|---|---|---|
| Power | False-positive rate | Power | False-positive rate | |
| Total | 0.76 | 0.005 | 0.31 | 0.005 |
| Gene 1 | 0.74 | 0.002 | 0.16 | 0.002 |
| Gene 2 | 0.88 | 0.002 | 0.29 | 0.005 |
| Gene 3 | 0.96 | 0.0003 | 0.43 | 0.004 |
| Gene 4 | 0.65 | 0.001 | 0.33 | 0.006 |
| Gene 5 | 0.56 | 0.001 | 0.35 | 0.007 |
| Gene 6 | 0.59 | 0.001 | 0.38 | 0.008 |
| Gene 7 | 0.83 | 0.1 | 0.27 | 0.02 |
| Gene 8 | 0.89 | 0.3 | 0.28 | 0.08 |
| Gene 9 | 0.72 | 0.4 | 0.29 | 0.2 |
The total false-positive rate was controlled at 0.005. The power and the false–positive rate were the average values across 1000 simulations. They were also calculated for genes 1–9 separately. For genes 1–3, the annotations for all five isoforms are known and Δβ1 = (−1.8,1.8,0,0,0), Δβ2 = (−1.8,2.4,0,0,0), and Δβ3 = (−2.4,2.4,0,0,0). For genes 4–6, the annotations of isoform 5 are missing but isoform 5 is differentially expressed together with isoform 1 and isoform 2:Δβ4 = (−1.8,2.4,0,0,1.2), Δβ5 = (−1.8,2.4,0,0,1.8), and Δβ6 = (−1.8,2.4,0,0,2.4) The correlations between the exon arrangements of isoform 5 and those of isoforms 1, 2, 3 and 4 are −0.2, −0.2, −0.2 and −0.32. For genes 7–9, the annotations of isoform 4 are missing. But isoform 4 is differentially expressed together with isoform 1 and isoform 2: Δβ7 = (−1.8,2.4,0,1.2,0), Δβ8 = (−1.8,2.4,0,1.8,0), and Δβ9 = (−1.8,2.4,0,2.4,0). The correlations between isoform 4 and isoforms 1, 2, 3 and 5 are −0.32, −0.32, 0.63 and −0.32.
Figure 4.Power of BASIS and the least squares fit for 100 different matrix Es. The powers were calculated based on 1000 simulations on 100 genes. The total false-positive rate was controlled at 0.005.
Figure 5.Differentially expressed transcript isoforms predicted by isoform-specific junction reads and BASIS. For each comparison (A versus B), we declared a transcript isoform as up-regulated in A or B if the junction read difference A – B > 4 or B – A > 4. The junction read has to be isoform-specific to the transcript. Grey areas represent the proportions of transcripts declared as differentially expressed by both junction read difference and BASIS. White areas represent the proportions of transcripts predicted as differentially expressed by junction read difference but not by BASIS.
Figure 6.Experimental validation of BASIS prediction. Real time RT–PCR barplots of tested transcripts’ relative expression levels between mouse brain and liver (A), between mouse brain and muscle (B), and between HeLa and HepG2 cells (C). Relative expression ratio (condition 1/condition 2) = 1 means no differential expression between two conditions. Relative expression ratio >1 means higher expression in condition 1. Relative expression ratio <1 means higher expression in condition 2. Black bars are transcripts predicted to have higher expression levels in condition 1 by BASIS and white bars are transcripts predicted to have higher expression levels in condition 2. Gray bars are those predicted not to be differentially expressed between two conditions. Value represents mean ± SEM, N = 3.
Figure 7.Agarose gel electrophoresis of RT–PCR products. (A). Tested transcripts were separated on agarose gels in lanes denoted by B (brain), L (liver), M (muscle) and −(RT negative control). One and two are two different isoforms whose transcript IDs can be referred in (C). Gapdh, mRps18a and Sdha are internal control genes. (B) Tested transcripts were separated on agarose gels in lanes denoted by H (HeLa cell), He (HepG2 cell) or −(RT negative control). HRPT1, RPLP0 and SDHA are internal control genes. The length of each product is identical to the PCR target region. Arrow points to the brightest DNA size marker 300 bp. Below it, they are 200 bp, 100 bp and 50 bp. Above it, they are 400 bp, 500 bp, 600 bp, 700 bp, 800 bp and 1 kb.