| Literature DB >> 31193391 |
Isaac A Babarinde1, Yuhao Li1, Andrew P Hutchins1.
Abstract
The measurement of gene expression has long provided significant insight into biological functions. The development of high-throughput short-read sequencing technology has revealed transcriptional complexity at an unprecedented scale, and informed almost all areas of biology. However, as researchers have sought to gather more insights from the data, these new technologies have also increased the computational analysis burden. In this review, we describe typical computational pipelines for RNA-Seq analysis and discuss their strengths and weaknesses for the assembly, quantification and analysis of coding and non-coding RNAs. We also discuss the assembly of transposable elements into transcripts, and the difficulty these repetitive elements pose. In summary, RNA-Seq is a powerful technology that is likely to remain a key asset in the biologist's toolkit.Entities:
Keywords: Genome; Long non-coding RNA; RNA-Seq; Transcript; Transposable element
Year: 2019 PMID: 31193391 PMCID: PMC6526290 DOI: 10.1016/j.csbj.2019.04.012
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Typical decision lines in coding and non-coding transcript assembly. The upper part summarizes wet experimental procedures required to produce RNA-Seq reads. The lower part highlights the computational analyses and decision lines. Transcript assembly starts with the evaluation of read quality, and can proceed with or without reference annotations. Blue square boxes denote decision points on tools to use, and arrows denote strategic considerations in how to analyze the RNA-Seq data. Dotted lines indicate optional pathways. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Selected tools for transcript assembly
| Process | Tool | Purpose | Input | Output |
|---|---|---|---|---|
| Read treatment | FastQC | Checks the integrity and quality of reads | Fastq files | Quality charts |
| FastX toolkit, Flexbar, Trimommatic | Filters or trims reads | Fastq files | Clean reads; reports | |
| Assembly | Trinity, Trans-ABySS, Oases, SSP, IDBA-tran | Assembles reads without reference | Clean reads | Assembled transcripts |
| TOPHAT, STAR, HISAT, HISAT2 with stringTie | Assembles reads with reference annotation | Clean reads, genomic reference, reference annotation | Assembled transcripts | |
| Transcript Classification | BEDtools, glBase | Checks overlap between coordinates | BED, GTF, GFF files | BED, GTF, GFF, report files |
| BLAST, BLAT, GMAT, Augustus | Homology based classification | Fatsa files | Alignments, reports | |
| CPAT, FEELnc, NRC, lncRScan-SVM | Coding potential assessment | GTF or fasta files; reference annotations (mRNA fasta or GTF and genomic fasta) | Coding potential scores, reports | |
| Mapping | TOPHAT, STAR, HISAT, HISAT2, Bowtie, BWA | Aligns reads to transcript or genes | Reads; reference annotations (gtf) | Alignments (bam, sam) |
| Quantification | RSEM, StringTie, bam-readcount, featureCount | Estimates transcript abundance | Alignment files | Abundance estimates |
| Sailfish, Salmon, Kallisto | Estimates abundance without alignment | Reads; reference annotations | Abundance estimates |
-Example tools and approaches for classifying coding and non-coding transcripts.
| Approach | Instances | Example tools |
|---|---|---|
| Coordinate overlap | Known coordinates from good genome annotations | BEDTools, glbase |
| Homology based | Known sequence and reasonable databases | BLAST, BLAT, GMAP, AUGUSTUS |
| Machine learning | Characterizing features of coding and noncoding transcripts | CPAT, FELLnc, lncRScan-SVM, NRC |
Fig. 2Splicing of transposable elements into genes. RNA-Seq data from mouse ESCs showing the control (shLuc) or a knockdown of the RING finger domain, polycomb 1 protein RNF2 (shRnf2), that leads to the activation of expression of two genes: Nat8f2 (panel A) and Apol7b (panel B). For each genomic view, the first row shows the short-read RNA-Seq read pileup density in the control (shLuc; red) and knockdown (shRnf2; blue) experiments. The second row shows the novel splice junctions detected in the RNA-Seq data when Rnf2 is knocked down (shRnf2). Splice junctions that join an exon of Nat8f2 or Apol7b to an LTR are indicated in red, other splice junctions are indicated in grey. The third row shows the GENCODE genes at this locus. The fourth row shows the locations of the LTRs (red), SINEs (green) and LINEs (blue). LTRs that show evidence of splicing into Nat8f2 or Apol7b are labeled. Data is from GSE108091 [147]. Reads were aligned to the mm10 genome using STAR [111], with the parameters described in [147]. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Example tools for different stages of RNA-seq.
| Analysis | Conditions | When to use | Recommended read type | Useful tools | Possible pipeline |
|---|---|---|---|---|---|
| Mapping | Transcripts as reference | Reliable and near-complete annotations | Short reads | Bowtie2, STAR, HISAT | Trinity package |
| Genome sequences as reference | Poor transcript annotation, new assembly | Long reads | GMAP, Minimap2, STAR | ||
| Quantification | Good annotation | Normal expression level estimation | Short read | RSEM, Kallisto, Salmon | |
| Poor/no annotation | Assembly follwed by quantification | Long and short reads | Hisat2/StringTie, TopHat/Cufflin | Hisat2/StringTie, TopHat/Cufflinks | |
| Gene level quantification | Comparing genes | Short reads | RSEM, Kallisto, Salmon | RSEM | |
| Count of aligned reads | Expression level from alignments | Sort reads | HTSeq, featureCount | ||
| Transcript level quantification | Interest in isoforms | Short reads | RSEM, StringTie, TopHat | ||
| Limited computational resources | Quick estimation | Short reads | Kallisto, Salmon | Toil | |
| Repeatitive element | Transposable element quantification | Long and short reads | RSEM with special parameters | LIONS | |
| Assembly | Good annotation | Long and short reads | Isoseq followed by GMAP, Minimap or STAR | ||
| Poor/no annotation | Good transcript annotation | Long reads | Isoseq followed by GMAP, Minimap or STAR | ||
| Repeatitive element | Transposable element expression | Long reads | Isoseq followed by GMAP, Minimap or STAR | ||
| Automated process | Sequential analyses | Limited bioinformatics skills | Long or short reads | Numerous tools | SystemPipeR, VIPER, hppRNA |