Dimitra Sarantopoulou1,2, Thomas G Brooks1, Soumyashant Nayak1, Antonijo Mrčela1, Nicholas F Lahens1, Gregory R Grant3,4. 1. Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA. 2. National Institute on Aging, National Institutes of Health, Baltimore, MD, USA. 3. Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA. ggrant@pennmedicine.upenn.edu. 4. Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA. ggrant@pennmedicine.upenn.edu.
Abstract
BACKGROUND: Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. RESULTS: Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. CONCLUSIONS: Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively.
BACKGROUND: Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. RESULTS: Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. CONCLUSIONS: Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively.
Entities:
Keywords:
Benchmarking; Isoform quantification; Pseudo-alignment; RNA-seq; Short reads; Simulated data
Authors: Katharina E Hayer; Angel Pizarro; Nicholas F Lahens; John B Hogenesch; Gregory R Grant Journal: Bioinformatics Date: 2015-09-03 Impact factor: 6.937