| Literature DB >> 26609224 |
Yixing Han1, Shouguo Gao2, Kathrin Muegge3, Wei Zhang4, Bing Zhou5.
Abstract
Next-generation sequencing technologies have revolutionarily advanced sequence-based research with the advantages of high-throughput, high-sensitivity, and high-speed. RNA-seq is now being used widely for uncovering multiple facets of transcriptome to facilitate the biological applications. However, the large-scale data analyses associated with RNA-seq harbors challenges. In this study, we present a detailed overview of the applications of this technology and the challenges that need to be addressed, including data preprocessing, differential gene expression analysis, alternative splicing analysis, variants detection and allele-specific expression, pathway analysis, co-expression network analysis, and applications combining various experimental procedures beyond the achievements that have been made. Specifically, we discuss essential principles of computational methods that are required to meet the key challenges of the RNA-seq data analyses, development of various bioinformatics tools, challenges associated with the RNA-seq applications, and examples that represent the advances made so far in the characterization of the transcriptome.Entities:
Keywords: RNA-seq; alternative splicing; co-expression network; data preprocessing; differential gene expression; pathway analysis; systems biology; variants detection
Year: 2015 PMID: 26609224 PMCID: PMC4648566 DOI: 10.4137/BBI.S28991
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1Overview of the typical RNA-seq pipeline. Three main sections are presented: The Experimental Biology, The Computational Biology and The Systems Biology. The pipeline starts from the experimental preparation and come with the work flow to the sequencing and analysis steps as the arrows point from step to step.
Overiew of technical specifications of next generation sequencing platforms.*
| PLATFORM | ILLUMINA GAIIx | ILLMINA HiSeq 2000 | ILLUMINA MiSeq V2 | SOLiD–5500xl | 454 GS FLX+ | ION TORRENT PGM | PacBio RS |
|---|---|---|---|---|---|---|---|
| Chemistry principle | Sequence-by-synthesize | Sequence-by-synthesize | Sequence-by-synthesize | Ligation and two base coding | Pyro-sequencing | Proton detection | Real-time sequencing |
| Instrument price | $256 K | $654 K | $128 K | $251 | $450 K | $80 K (System price including PGM, server, OneTouch and OneTouch ES.) | $695 K |
| Sequence yield per run | 30Gb | 600Gb | 1.5–2Gb | 150Gb | 0.7Gb | 50 Mb on 314 chip, 400 Mb on 316 chip, 1.5Gb on 318 chip | 100 Mb |
| Sequence cost per GB | $148 | $45 | $502 | $67.00 | $50 | $800 (318 chip) | $2,000 |
| Reagent cost per run | $17,575 | $23,470 | $1,070 | $10,503 | $4,842 | $349 on 314 chip, $549 on 316 chip, $749 on 318 chip | ≥$300 |
| Reagent cost per MB | $0.19 | >$0.04 | $0.14 | <$0.07 | $7 | $5 on 314 chip, $1.2 on 316 chip, $0.6 on 318 chip | $2–17 |
| Run time | 10 days | 11 days | 27 hours | 7 Days for SE 14 Days for PE | 20 hours | 2–5 hours | 2 hours |
| Observed raw error rate | 0.76% | 0.26% | 0.80% | <0.1% | 1% | ∼1% | ∼10% |
| Read length | Up to 150 bases | Up to 150 bases | Up to 150 bases | 85 bases | 700 base | ∼200 bases | 3000 bases, up to 15000 bases |
| Read type | PE | PE | PE | PE | SR | PE | SR |
| Insert size | Up to 700 bases | Up to 700 bases | Up to 700 bases | 300 bases | Up to 40 kb | Up to 250 bases | Up to 10 kb |
| Typical DNA amount requirement | 50–1000 ng | 50–1000 ng | 50–1000 ng | 400–4000 ng | 25–1000 ng | 100–1000 ng | ∼1 µg |
| Computation resources | $222 cluste | $222 cluste | Desktop/cloud | $35 cluster | $5 (desktop) | $16.5 (desktop) | $65 cluster |
| Data file sizes (GB) | 600 | <600 | 1 | 148 | 40 images, 8 sff | 0.1 sff, 0.2 fastq on 314 chip, 5 sff, 1 fastq on 316 chip, 10 sff, 2.5 fastq on 318 chip | 2 (basecalls, QV, kinetics) |
Notes:
Information based on company sources alone, data update to 2013–2014.
Cost only count the cost per run, does not include general purpose and library preparation equipment, annual maintenance agreements and extra sevices.
New compressed binary data format saves base and quality-value data in a 1byte: 1base ratio.
Selected list of packages and tools for RNA-seq data analysis.
| ANALYSIS STEP | PACKAGE | DESCRIPTION AND COMMENTS | REFERENCES |
|---|---|---|---|
| Quanlity assessment and preprocessing | FastQC | A sequencing quality evaluator, easy to use, reports with reads quality visualized graphically. | |
| HTQC | A toolkit including statistics tool for illumina high-throughput sequencing data, and filtration tools for sequence quality, length, tail quality. Depict the base calling and evaluate the base quality at position based way and the overall read features. | ||
| Trimmomatic | Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data. Remove PCR primers, adpater sequences, scan every read with a 4-base sliding window and trimming the lower-scored bases and low quality N bases to enhance the reads qualityflexible, can handle paired end data. | ||
| BBMap | Short read aligner for DNA and RNA-seq data. Capable of handling arbitrarily large genomes with millions of scaffolds. Handles Illumina, PacBio, 454, and other reads; very high sensitivity and tolerant of errors and numerous large indels. Very fast. BBMerge included which can merge paired reads based on overlap to create longer reads and creates an insert-size histogram. | ||
| FLASH | A rapid and cost-effective method for large-scale assembly of TALENs. combines paired-end reads that overlapped and converts them to single long reads. | ||
| RSeQC | RSeQC package provides powerful modules that can comprehensively evaluate RNA-seq data after the preprocessing procedure. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, etc. | ||
| Mapping | ELAND | The first short read aligner but not the fastest any more. Eland substantially influences many aligners in this category and still outperforms many followers. Eland itself works for 32 bp single-end reads only. Additional Perl scripts in GAPipeline extend its ability. | |
| SOAP | A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. SOAP is compatible with numerous applications, including single-read or pair-end resequencing, small RNA discovery and mRNA tag sequence mapping. SOAP is a command-driven program, which supports multithreaded parallel computing, and has a batch module for multiple query sets. | ||
| SOAP2 | An updated version of SOAP software for short reads alignment. Super fast and accurate alignment for huge amounts of short reads, includes a single individual genotype caller (SOAPsnp, SOAPsnv, SOAPindel) | ||
| MAQ | A program to align short reads and to call variants. Features includes PET mapping, quality aware, gapped alignment for PET, mapping quality, adapter trimming, partial occurrences counting, and SNP caller. | ||
| Bowtie | An ultrafast, memory-efficient short read aligner. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small. Useful unspliced aligners. | ||
| BWA | A software package for mapping low-divergent sequences against a large reference genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM, which are suitable for reads length from 70 bp to 1Mb. | ||
| ZOOM | A framework that is able to map the Illumina/Solexa reads of 15x coverage of a human genome to the reference human genome in one CPU-day, allowing two mismatches, at full sensitivity. | ||
| STAR | An ultrafast universal RNA-seq aligner which utilizes sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR has a potential for accurately aligning long (several kilobases) reads that are emerging from the third-generation sequencing technologies. | ||
| BLAT | |||
| HTSeq | A Python framework to work with high-throughput sequencing data, able to perform sequencing quality evaluation, reads counting. It is flexible that customize the needs by writing scripts or just use the stand alone scripts. | ||
| Easy RNASeq | A bioconductor package for processing RNA-Seq data, which perform count summarization per feature of interest and count normalization. | ||
| Geno micRanges | A bioconductor package defines general purpose containers for storing genomic intervals. Specialized containers for representing and manipulating short alignments against a reference genome are defined in the GenomicAlignments package. | ||
| Feature-Counts | An R package suitable for counting reads generated from either RNA or genomic DNA sequencing. It implements highly efficient chromosome hashing and feature blocking techniques so considerably faster than existing methods and requires far less computer memory. | ||
| Expression quantification | Alexa-seq | A comprehensive package that include a database for alignment, gene expression euantification, extract isoform features and visualize the results. | |
| Cufflinks | Transcriptome assembly and differential expression analysis for RNA-Seq. It also can perform Isoform Quantification, Maximum likelihood estimation of relative isoform expression. | ||
| RSEM | A package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. | ||
| Differential expression | Cuffdiff | A robust and accurate tool for differential analysis of RNA-Seq experiments. isoform level analysis, Uses isoform levels in analysis. | |
| DESeq | An R package to analyse count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression. It uses multi-factors analysis, Poisson GLM. | ||
| DESeq2 | A method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. | ||
| EdgeR | A bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. | ||
| PoissonSeq | A method for normalization, testing, and false discovery rate estimation for RNA-sequencing data based on poisson log-linear model. | ||
| Limmavoom | Limma is data analysis R package based on linear models and differential expression for microarray data. voom function in the limma package offers a way to transform count data into Gaussian distributed data so that significance can be tested statistically. | ||
| MISO | A probabilistic framework that quantitates the expression level of alternatively spliced genes from RNA-Seq data, and identifies differentially regulated isoforms or exons across samples. | ||
| Altenative splicing | TopHat | A widely used, fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. | |
| MapSplice | An algorithm for mapping RNA-seq data to reference genome for splice junction discovery. It utilizes the exon-first methods, supports both single-end and pair-end reads with high memory efficiency and accuracy. | ||
| SpliceMap | A de novo splice junction discovery and alignment tool. It offers high sensitivity and accuracy and support for arbitrary RNA-seq read lengths. | ||
| SplitSeek | A program for de novo prediction of splice junctions in RNA-seq data. It utilizes the exon-first method. | ||
| GEM mapper | A fast, accurate and versatile alignment by filtration. It can leverage string matching by filtration to search the alignment space more efficiently, simultaneously delivering precision and speed. | ||
| SpliceR | An easy-to-use tool that extends the usability of RNA-seq and assembly technologies by allowing greater depth of annotation of RNA-seq data. | ||
| Splicing-Compass | A method and software to predict genes that are differentially spliced between two different conditions using RNA-seq data. | ||
| GliMMPS | A robust statistical method for detecting splicing quantitative trait loci (sQTLs) from RNA-seq data. | ||
| MATS | A computational tool to detect differential alternative splicing events from RNA-Seq data. The statistical model of MATS calculates the P-value and false discovery rate that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold. From the RNA-Seq data, MATS can automatically detect and analyze alternative splicing events corresponding to all major types of alternative splicing patterns. MATS handles replicate RNA-Seq data from both paired and unpaired study design. | ||
| rMATS | A statistical model and computer program designed for detection of differential alternative splicing from replicate RNA-Seq data. rMATS uses a hierarchical model to simultaneously account for sampling uncertainty in individual replicates and variability among replicates. | ||
| Varients detection | GATK | Package for aligned NGS data analysis, which includes a SNP and genotype caller (Unifed Genotyper), SNP filtering (Variant Filtration) and SNP quality recalibration (Variant Recalibrator). | |
| ANNOVAR | An efficient software tool to functionally annotate genetic variants (gene-based, region-based or filter-based) detected from diverse genomes. | ||
| SNPiR | A highly accurate approach termed SNPiR to identify SNPs in RNA-seq data. | ||
| SNiPlay3 | A web-based application for exploration and large scale analyses of genomic variations. | ||
| Pathway analysis | GSEA | A knowledge-based approach for interpreting genome-wide expression profiles. It determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (eg, phenotypes). | |
| GSVA | A non-parametric, unsupervised method for estimating variation of gene set enrichment through the samples of a expression data set. GSVA performs a change in coordinate systems, transforming the data from a gene by sample matrix to a gene-set by sample matrix, thereby allowing the evaluation of pathway enrichment for each sample. | ||
| SeqGSEA | The package generally provides methods for gene set enrichment analysis of high-throughput RNA-Seq data by integrating differential expression and splicing. It uses negative binomial distribution to model read count data, which accounts for sequencing biases and biological variation. Based on permutation tests, statistical significance can also be achieved regarding each gene’s differential expression and splicing, respectively. | ||
| GAGE | An evaluation of the very latest large-scale genome assembly algorithms. | ||
| SPIA | An R package that uses the information form a list of differentially expressed genes and their log fold changes together with signaling pathways topology, in order to identify the pathways most relevant to the condition under the study. | ||
| TAPPA | A java-based tool, for identification of phenotype-associated genetic pathways utilizing the pathway topological measures. | ||
| DEAP | A tool capitalizes on information about biological pathways to identify important regulatory patterns from differential expression data. It makes significant improvements over existing approaches by including information about pathway structure and discovering the most differentially expressed portion of the pathway. | ||
| GSAA SeqSP | A toolset for gene set association analysis of RNA-Seq count data. GSAASeqSP identify pathways/gene sets significantly associated with a disease or a phenotype by analyzing genome-wide patterns of gene expression variation measured by RNA-Seq technology. | ||
| Co-expression network | GSCA | An open source software package to help researchers use massive amounts of publicly available gene expression data (PED) to make discoveries. Users can interactively visualize and explore gene and gene set activities in 25,000+ consistently normalized human and mouse gene expression samples representing diverse biological contexts. | |
| DICER | A method for detecting differentially co-expressed gene sets using a novel probabilistic score for differential correlation. DICER goes beyond standard differential co-expression and detects pairs of modules showing differential co-expression. | ||
| WGCNA | A powerful method to extract co-expressed groups of genes from large microarray data sets and has been successfully applied to RNA-seq data. It is suggested to remove genes whose read counts are consistently low and normalize the data with a variance-stabilizing transformation before calculating pairwise similarity of expression pattern. |
Figure 2The STARR-seq pipeline and the corresponding ‘systems biology’ steps. The sonicated genomic DNA are PCR amplified and placed downstream of a minimal promoter in reporter vectors. The desired measurement are embedded in the genome. The reporter library is transfected into the cultured cell lines and Poly-A RNAs are isolated from the pool of total RNA. These steps are selectively to enrich the targets interested. After RNA-seq is performed, the reads are mapped to the reference genome and their enrichment over input are measured to reflect enhancer activity. The steps of systems biology including mathematics and computational biology analysis will help with the interpretation.