Bogdan Mirauta1, Pierre Nicolas, Hugues Richard. 1. Biologie Computationnelle et Quantitative, UPMC and CNRS UMR7238, Paris, France and Mathématique Informatique et Génome, INRA UR1077, Jouy-en-Josas, France.
Abstract
MOTIVATION: The most common RNA-Seq strategy consists of random shearing, amplification and high-throughput sequencing of the RNA fraction. Methods to analyze transcription level variations along the genome from the read count profiles generated by the RNA-Seq protocol are needed. RESULTS: We developed a statistical approach to estimate the local transcription levels and to identify transcript borders. This transcriptional landscape reconstruction relies on a state-space model to describe transcription level variations in terms of abrupt shifts and more progressive drifts. A new emission model is introduced to capture not only the read count variance inside a transcript but also its short-range autocorrelation and the fraction of positions with zero counts. The estimation relies on a particle Gibbs algorithm whose running time makes it more suited to microbial genomes. The approach outperformed read-overlapping strategies on synthetic and real microbial datasets. AVAILABILITY: A program named Parseq is available at: http://www.lgm.upmc.fr/parseq/. CONTACT: bodgan.mirauta@upmc.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: The most common RNA-Seq strategy consists of random shearing, amplification and high-throughput sequencing of the RNA fraction. Methods to analyze transcription level variations along the genome from the read count profiles generated by the RNA-Seq protocol are needed. RESULTS: We developed a statistical approach to estimate the local transcription levels and to identify transcript borders. This transcriptional landscape reconstruction relies on a state-space model to describe transcription level variations in terms of abrupt shifts and more progressive drifts. A new emission model is introduced to capture not only the read count variance inside a transcript but also its short-range autocorrelation and the fraction of positions with zero counts. The estimation relies on a particle Gibbs algorithm whose running time makes it more suited to microbial genomes. The approach outperformed read-overlapping strategies on synthetic and real microbial datasets. AVAILABILITY: A program named Parseq is available at: http://www.lgm.upmc.fr/parseq/. CONTACT: bodgan.mirauta@upmc.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.