| Literature DB >> 22293517 |
Mark F Rogers1, Julie Thomas, Anireddy Sn Reddy, Asa Ben-Hur.
Abstract
We propose a method for predicting splice graphs that enhances curated gene models using evidence from RNA-Seq and EST alignments. Results obtained using RNA-Seq experiments in Arabidopsis thaliana show that predictions made by our SpliceGrapher method are more consistent with current gene models than predictions made by TAU and Cufflinks. Furthermore, analysis of plant and human data indicates that the machine learning approach used by SpliceGrapher is useful for discriminating between real and spurious splice sites, and can improve the reliability of detection of alternative splicing. SpliceGrapher is available for download at http://SpliceGrapher.sf.net.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22293517 PMCID: PMC3334585 DOI: 10.1186/gb-2012-13-1-r4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Example of a predicted splice graph in . RNA-Seq alignment data were loaded along with gene model annotations to create a composite model that incorporates all available evidence. SpliceGrapher's visualization modules produce color-coded graphs based on the color scheme used by Sircah [29] that makes it easy to see exons and introns involved in AS events. RNA-Seq read coverage across one of the introns was sufficient to allow SpliceGrapher to identify an intron retention event (exon outlined in blue). In addition, a novel splice junction (highlighted in green) provided SpliceGrapher with evidence for an alternative 3' splicing event (highlighted in orange). The numbers associated with splice junctions indicate the number of reads that align across it. Vertical bands in the background depict exon boundaries in the original gene model.
Figure 2Splice graph prediction pipeline. SpliceGrapher predicts splice graphs using information from gene models, EST alignments and RNA-Seq data. RNA-Seq exonic alignments may be performed using any popular short-read alignment tool. RNA-Seq spliced alignments may be performed using a conventional short-read mapping tool with a database of splice junctions predicted by SpliceGrapher, or they may be performed using short-read spliced-alignment programs such as TopHat, followed by filtering using SpliceGrapher's database of predicted splice sites. SpliceGrapher incorporates all of this information to produce a comprehensive splice graph prediction.
Alternative splicing events
| AS | AS events | |||||
|---|---|---|---|---|---|---|
| genes | IR | ES | Alt. 5' | Alt. 3' | Total | |
| 4,029 | 1,987 (33%) | 550 (9%) | 1,256 (21%) | 2,145 (36%) | 5,938 | |
| | ||||||
| No ESTs | 4,901 | 2,248 (30%) | 714 (10%) | 1,560 (21%) | 2,866 (39%) | 7,388 |
| Novel | 885 | 308 (21%) | 164 (11%) | 304 (20%) | 721 (48%) | 1,497 |
| With ESTs | 6,162 | 3,658 (33%) | 994 (9%) | 2,335 (21%) | 4,128 (37%) | 9,916 |
| Novel | 2,154 | 1,779 (34%) | 444 (8%) | 1,079 (20%) | 1,983 (38%) | 5,285 |
| | ||||||
| No gene models | 1,263 | 449 (32%) | 383 (28%) | 237 (17%) | 319 (23%) | 1,388 |
| Novel | 699 | 429 (32%) | 380 (28%) | 232 (17%) | 304 (23%) | 1,345 |
| With gene models | 6,056 | 4,029 (39%) | 2,857 (27%) | 1,427 (14%) | 2,106 (20%) | 10,419 |
| Novel | 2,319 | 2,232 (38%) | 2,550 (43%) | 552 (9%) | 594 (10%) | 5,928 |
| | ||||||
| No gene models | 2,777 | 893 (17%) | 475 (9%) | 1,481 (27%) | 2,555 (47%) | 5,404 |
| Novel | 1,591 | 811 (16%) | 460 (9%) | 1,431 (28%) | 2,351 (47%) | 5,053 |
| With gene models | 10,458 | 94,571 (85%) | 598 (1%) | 5,972 (5%) | 9,820 (9%) | 110,961 |
| Novel | 8,364 | 94,124 (86%) | 476 (0%) | 5,697 (5%) | 9,219 (8%) | 109,516 |
| 0 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 | |
| | 2,039 | 347 (13%) | 830 (31%) | 640 (24%) | 838 (32%) | 2,655 |
| | ||||||
| No gene models | 3,099 | 531 (10%) | 684 (13%) | 1,321 (25%) | 2,743 (52%) | 5,279 |
| With gene models | 15,874 | 135,585 (72%) | 4,938 (3%) | 23,615 (13%) | 24,406 (13%) | 188,544 |
| | ||||||
| No gene models | 1,057 | 324 (24%) | 519 (39%) | 140 (11%) | 349 (26%) | 1,332 |
| With gene models | 4,263 | 4,120 (34%) | 3,148 (26%) | 2,165 (18%) | 2,818 (23%) | 12,251 |
The number of AS events detected by SpliceGrapher, Cufflinks, and TAU compared with events inferred from the TAIR9 annotations. We track the following AS event types: intron retention (IR), exon skipping (ES), alternative 5' sites (Alt. 5') and alternative 3' sites (Alt. 3'). SpliceGrapher uses the TAIR9 gene models as a baseline, so it includes all of the same AS events along with additional events inferred from RNA-Seq data. Without gene models, nearly all TAU and Cufflinks predictions are novel AS events. With gene models, more than half of Cufflinks predictions reproduce AS events from the gene models. TAU uses known splice sites to predict all possible exons in a gene, generating vast numbers of novel exons and novel IR events.
Figure 3Ambiguities in RNA-Seq data. This figure demonstrates ambiguities that arise in RNA-Seq data that make isoform prediction challenging. Because there is read coverage across several introns, SpliceGrapher is not able to determine whether this is a result of a single intron retention event, or several independent events.
Recall of TAIR9 and TAIR10 annotations
| TAIR9 | TAIR10 | |||||||
|---|---|---|---|---|---|---|---|---|
| Recall | Novel | Transcripts | Recall | |||||
| Method | Exons | Introns | Exons | Introns | Number | Percentage | Exons | Introns |
| | 1.00 | 1.00 | 1,428 | 1,282 | 28 | 1.4% | 0.039 | 0.045 |
| | 1.00 | 1.00 | 11,299 | 3,557 | 38 | 1.9% | 0.050 | 0.056 |
| | ||||||||
| No gene models | 0.25 | 0.19 | 33,252 | 3,425 | 0 | 0.0% | 0.035 | 0.017 |
| With gene models | 0.94 | 0.89 | 12,690 | 5,222 | 4 | 0.2% | 0.017 | 0.008 |
| | ||||||||
| No gene models | 0.21 | 0.29 | 86,346 | 5,335 | 0 | 0.0% | 0.043 | 0.029 |
| With gene models | 0.69 | 0.79 | 115,130 | 3,734 | 11 | 1.1% | 0.079 | 0.074 |
Comparison of the ability of SpliceGrapher, Cufflinks, and TAU to predict TAIR9 and TAIR10 annotations in A. thaliana. The columns of the table provide recall levels of TAIR9 annotations, the number of novel exons and introns that are predicted, i.e., are not in the TAIR9 annotations, the number and percentage of TAIR10 transcripts that are predicted, and the recall level of TAIR10 annotations at the exon and intron level. When TAU and Cufflinks rely on RNA-Seq data alone they tend to produce graphs that are missing many of the features found in the gene models, as reflected in their recall scores between 0.19 and 0.29. When we provide them with TAIR9 annotations their recall scores improve, though TAU's improved statistics result from a vast number of novel exons, many of which may be spurious. When comparing these predictions with novel splice forms in the TAIR10 gene models, SpliceGrapher predicts more novel splice forms correctly than the other packages.
Figure 4Example of a Cufflinks prediction. We provide the predictions made by Cufflinks for the same gene whose SpliceGrapher predictions are shown in Figure 1. Some of the splice junctions used by Cufflinks are predicted to be false positives by SpliceGrapher's accurate splice junction classifiers (red edges in the plot). These lead to detection of questionable AS events.
Comparison of splice junctions identified by each package
| SpliceGrapher | Supersplat | TopHat | |
|---|---|---|---|
| Canonical junctions (GT-AG/GC-AG) within genes | 80,421 | 84,744 | 83,367 |
| Junctions in common | - | 74,821 | 63,710 |
| Novel junctions | 4,969 | 7,255 | 14,572 |
| Novel junctions with a false-positive site | - | 3,077 | 9,942 |
| Novel junctions in common | - | 3,599 | 1,982 |
| Canonical junctions (GT-AG/GC-AG) within genes | 74,457 | 82,281 | 65,439 |
| Junctions in common | - | 70,554 | 59,154 |
| Novel junctions | 9,831 | 13,394 | 6,307 |
| Novel junctions with a false-positive site | - | 4,040 | 1,899 |
| Novel junctions in common | - | 8,020 | 3,662 |
Side-by-side comparison of canonical splice junctions identified by each package, reconciling differences between SpliceGrapher and the other two packages for A. thaliana (top half) and V. vinifera (bottom half). For each species we show the number of canonical splice junctions each package found recapitulated in the RNA-Seq reads. For Supersplat and TopHat we also show the number of junctions each shares with SpliceGrapher and the number of novel junctions for which SpliceGrapher's classifiers identified either the donor site or the acceptor site as a false positive.
Recall of splice junctions identified from EST data
| Species | SpliceGrapher | Supersplat | TopHat |
|---|---|---|---|
| 24,757 | 25,590 | 22,934 | |
| 35,403 | 37,291 | 34,081 |
We compared the splice junctions that each package identified in RNA-Seq data to splice junctions inferred from ESTs aligned to A. thaliana and V. vinifera. Numbers represent the number of splice junctions from spliced alignments of RNA-Seq data that were also identified from alignments of ESTs (excluding junctions that are annotated in TAIR9). Despite performing a step of filtering of false positives, SpliceGrapher achieves a higher level of recall than TopHat, and only slightly lower than SuperSplat.
Figure 5SpliceGrapher prediction from RNA-Seq and EST data. This example shows how SpliceGrapher can use both RNA-Seq data and EST data to produce predictions that incorporate the strengths of each data type. RNA-Seq data provide evidence for two novel splice junctions (fourth panel down, highlighted in green) that SpliceGrapher uses to infer an alternative 3' splicing event. EST alignments provide compelling evidence for an intron retention event. SpliceGrapher combines these predictions into the final predicted graph.