Chen Suo1, Stefano Calza, Agus Salim, Yudi Pawitan. 1. Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden, Department of Molecular and Translational Medicine, University of Brescia, Italy and Department of Mathematics and Statistics, La Trobe University, Australia.
Abstract
MOTIVATION: RNA-sequencing technologies provide a powerful tool for expression analysis at gene and isoform level, but accurate estimation of isoform abundance is still a challenge. Standard assumption of uniform read intensity would yield biased estimates when the read intensity is in fact non-uniform. The problem is that, without strong assumptions, the read intensity pattern is not identifiable from data observed in a single sample. RESULTS: We develop a joint statistical model that accounts for non-uniform isoform-specific read distribution and gene isoform expression estimation. The main challenge is in dealing with the large number of isoform-specific read distributions, which potentially are as many as the number of splice variants in the genome. A statistical regularization via a smoothing penalty is imposed to control the estimation. Also, for identifiability reasons, the method uses information across samples from the same region. We develop a fast and robust computational procedure based on the iterated-weighted least-squares algorithm, and apply it to simulated data and two real RNA-Seq datasets with reverse transcription-polymerase chain reaction validation. Empirical tests show that our model performs better than existing methods in terms of increasing precision in isoform-level estimation. AVAILABILITY AND IMPLEMENTATION: We have implemented our method in an R package called Sequgio as a pipeline for fast processing of RNA-Seq data.
MOTIVATION: RNA-sequencing technologies provide a powerful tool for expression analysis at gene and isoform level, but accurate estimation of isoform abundance is still a challenge. Standard assumption of uniform read intensity would yield biased estimates when the read intensity is in fact non-uniform. The problem is that, without strong assumptions, the read intensity pattern is not identifiable from data observed in a single sample. RESULTS: We develop a joint statistical model that accounts for non-uniform isoform-specific read distribution and gene isoform expression estimation. The main challenge is in dealing with the large number of isoform-specific read distributions, which potentially are as many as the number of splice variants in the genome. A statistical regularization via a smoothing penalty is imposed to control the estimation. Also, for identifiability reasons, the method uses information across samples from the same region. We develop a fast and robust computational procedure based on the iterated-weighted least-squares algorithm, and apply it to simulated data and two real RNA-Seq datasets with reverse transcription-polymerase chain reaction validation. Empirical tests show that our model performs better than existing methods in terms of increasing precision in isoform-level estimation. AVAILABILITY AND IMPLEMENTATION: We have implemented our method in an R package called Sequgio as a pipeline for fast processing of RNA-Seq data.
Authors: Anze Zupanic; Catherine Meplan; Sushma N Grellscheid; John C Mathers; Tom B L Kirkwood; John E Hesketh; Daryl P Shanley Journal: RNA Date: 2014-08-21 Impact factor: 4.942