| Literature DB >> 24920826 |
Axel Rasche1, Matthias Lienhard2, Marie-Laure Yaspo2, Hans Lehrach2, Ralf Herwig1.
Abstract
The computational prediction of alternative splicing from high-throughput sequencing data is inherently difficult and necessitates robust statistical measures because the differential splicing signal is overlaid by influencing factors such as gene expression differences and simultaneous expression of multiple isoforms amongst others. In this work we describe ARH-seq, a discovery tool for differential splicing in case-control studies that is based on the information-theoretic concept of entropy. ARH-seq works on high-throughput sequencing data and is an extension of the ARH method that was originally developed for exon microarrays. We show that the method has inherent features, such as independence of transcript exon number and independence of differential expression, what makes it particularly suited for detecting alternative splicing events from sequencing data. In order to test and validate our workflow we challenged it with publicly available sequencing data derived from human tissues and conducted a comparison with eight alternative computational methods. In order to judge the performance of the different methods we constructed a benchmark data set of true positive splicing events across different tissues agglomerated from public databases and show that ARH-seq is an accurate, computationally fast and high-performing method for detecting differential splicing events.Entities:
Mesh:
Year: 2014 PMID: 24920826 PMCID: PMC4132698 DOI: 10.1093/nar/gku495
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Tissue data sets and true positive splicing events
| Tissue | Mapping level | Illumina 32 | Illumina 75 | Illumina 50f | Affymetrix exon array | AEdb TP |
|---|---|---|---|---|---|---|
| Heart | Exon | 26 497 133 | 90 096 333 | 152 095 841 | x | 13 |
| Junction | 1 792 700 | 16 524 040 | 15 970 634 | |||
| Liver | Exon | 33 764 463 | 111 467 996 | 147 140 423 | x | 40 |
| Junction | 2 974 503 | 30 096 758 | 16 596 701 | |||
| Testis | Exon | 49 450 100 | 112 055 446 | 117 648 767 | x | 62 |
| Junction | 4 610 852 | 27 751 282 | 13 021 071 | |||
| Breast | Exon | na | 98 295 674 | 160 271 222 | x | x |
| Junction | na | 19 416 413 | 18 377 157 | |||
| Kidney | Exon | na | 102 936 196 | 159 845 509 | x | 35 |
| Junction | na | 18 726 028 | 1 7260 878 | |||
| Prostate | Exon | na | 137 440 809 | 103 076 631 | x | 11 |
| Junction | na | 27 591 808 | 11 317 574 | |||
| Thyroid | Exon | na | 106 721 876 | 143 112 126 | x | 10 |
| Junction | na | 26 471 132 | 25 146 417 | |||
| Brain | Exon | 25 394 892 | 75 843 985 | 145 012 531 | na | 148 |
| Junction | 1 792 406 | 14 036 937 | 16 148 859 | |||
| Skeletal muscle | Exon | 35 937 904 | 111 017 485 | 113 197 592 | na | 15 |
| Junction | 2 857 497 | 26 281 433 | 12 503 218 | |||
| Lung | Exon | 29 783 930 | 128 222 626 | 123 907 913 | na | 22 |
| Junction | 2 330 613 | 26 039 119 | 13 785 738 | |||
| Colon | Exon | 50 295 406 | 117 311 390 | 129 051 270 | na | 13 |
| Junction | 3 467 812 | 20 682 248 | 17 143 662 | |||
| Adipose | Exon | na | 103 017 625 | 131 804 299 | na | x |
| Junction | na | 21 068 190 | 18 053 743 | |||
| Adrenal | Exon | na | 97 252 603 | 133 042 562 | na | x |
| Junction | na | 19 144 480 | 18 230 504 | |||
| Lymph node | Exon | na | 124 624 185 | 115 655 278 | na | x |
| Junction | na | 24 067 126 | 12 011 253 | |||
| Ovary | Exon | na | 118 850 487 | 121 800 678 | na | x |
| Junction | na | 24 972 481 | 12 456 813 | |||
| Blood | Exon | na | 135 567 404 | 144 373 657 | na | 18 |
| Junction | na | 29 120 343 | 13 847 002 | |||
| Cerebellum | Exon | na | na | na | x | x |
| Junction | na | na | na | |||
| Muscle | Exon | na | na | na | x | 36 |
| Junction | na | na | na | |||
| Pancreas | Exon | na | na | na | x | x |
| Junction | na | na | na | |||
| Spleen | Exon | na | na | na | x | 20 |
| Junction | na | na | na | |||
| No. of pairwise tissue comparisons | 21 | 45 | 45 | 28 | x | |
| No. of tissue specific comparisons | 4 | 7 | 7 | 8 |
Tissue data sets used in performance assessment. ‘Illumina 32’ is the union of two published data sets utilizing 32bp sequencing reads. ‘llumina 75’ and ‘Illumina 50f’ correspond to the HiSeq BodyMap V2.0 data set with 75 and 50 bp (forward) reads. Affymetrix exon array tissue data is available from the Affymetrix homepage. In the rightmost column the number of true positive splicing events from AEdb is shown. The last two rows give the number of pairwise-tissue and tissue-specific test cases which could be generated for the data sets. Abbreviations: TP, true positive splicing events; pw, pairwise; ts, tissue specific; jctn, junction; na, x, no data available.
Figure 1.Impact of alignment and read counting. ROC curves for differential splicing prediction comparing different junction alignment variants with respect to AEdb confirmed splicing events. Junction expression was computed with tophat (marked ‘_tophat’), MapSplice (marked ‘_MapSplice’), SpliceMap (marked ‘_SpliceMap’) and ‘synthetic’ junction windows (marked ‘_jctnWindowsBowtie’). Identified splice sites were mapped to Ensembl-annotated genes. ARH-seq predictions based solely on junction expression (marked ‘_jctn_’), exon expression (marked ‘_exon_’) and combination of both (combi-counts, marked ‘_combi_’) were compared. The left plot shows averaged pairwise tissue evaluations and the right plot the evaluation of the brain versus liver scenario with the ‘Illumina 75’ data set.
Figure 2.ARH-seq characteristics. (A) ARH-seq prediction performance for pairwise tissue comparisons on data sets generated with different sequence read lengths. (B) ARH-seq predictions (y-axis) versus gene expression changes (log2-scale; x-axis) in brain versus liver comparison. (C) ARH-seq predictions (y-axis) versus gene exon number (x-axis). All genes with the same exon number were summarized and according box plots of ARH-seq values for brain versus liver are shown. (D) Distribution of ARH-seq values plotted for all sequencing data sets. The resulting Weibull fit is superimposed as dashed line.
Figure 3.Methods comparison. (A) ROC curves for differential splicing prediction methods using ‘Illumina 75’ data set with all possible pairwise test cases (i.e. comparing one tissue against another tissue). (B) ROC curves assessing tissue-specific splicing events (i.e. comparing one tissue against all others). Due to highly variable sample sizes two methods had to be skipped. (C) ROC curves assessing differential splicing in brain versus liver. (D) Example of a detected true positive splicing event in the gene MPZL1. Exons are shown on the x-axis. RPKM values are visualized with the red dashed line for brain and blue solid line for liver. The splicing probabilities used for the entropy-based prediction are denoted as grey bars. Two exons known for splicing are marked with green dot-dashed lines. (E) AUC values for the different test cases including exon array results (pw = pairwise; ts = tissue specific; b2l = brain versus liver; EA = exon array data).
Figure 4.ARH-seq differential splicing prediction workflow. The proposed workflow starts with a set of sequencing reads and an Ensembl genome annotation and finally generates a set of spliced genes ordered by ARH-seq prediction scores. Reads are aligned to the genome with bowtie and counts are generated for exons and junction windows. Using RPKM-scaled values gene expression and combi-counts are calculated. Splicing prediction is performed with ARH-seq on the combi-counts. Spliced exons are judged by their splicing deviation. Finally, results are filtered by splicing strength and expression significance. Abbreviations: jctn, junction; nb, neighbouring.