| Literature DB >> 26758513 |
Dmitry Velmeshev1,2, Patrick Lally3,4, Marco Magistri5, Mohammad Ali Faghihi6.
Abstract
BACKGROUND: Next generation sequencing (NGS) technologies are indispensable for molecular biology research, but data analysis represents the bottleneck in their application. Users need to be familiar with computer terminal commands, the Linux environment, and various software tools and scripts. Analysis workflows have to be optimized and experimentally validated to extract biologically meaningful data. Moreover, as larger datasets are being generated, their analysis requires use of high-performance servers.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26758513 PMCID: PMC4710974 DOI: 10.1186/s12864-015-2346-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1CANEapp and the graphical user interface. a General structure of CANEapp. The Java application component is the only user-accessible component and operates on a personal computer to provide a point and click interface to configure RNA-seq analysis. The interface either establishes a connection with an Amazon Cloud instance (1) created using the preconfigured CANEapp Amazon Machine Image (AMI) or with a Unix server, in which case server-side pipeline components are automatically transferred to the server through the GUI. After configuring a project, the GUI communicates with the server side to transfer raw data files and options file and initiate the analysis. b Design of the CANEapp’s graphical user interface. c CANEapp GUI’s capabilities and project design steps. The Manage Projects tab allows creating, deleting or loading projects from a file. Additionally, user can see the status of the selected project on this tab. The next two tabs allow adding experimental groups and samples. On the Add Samples tab the user can specify the library preparation that has been used before sequencing and define such parameters as single or paired-end sequencing, strand selection and adapter sequences. The Analysis Settings tab is used to set up parameters of separate analysis steps, such as alignment, reconstruction and differential expression analysis. Finally, the last tab is used to specify server address and user credentials and initiate the analysis on the server side
Fig. 2Server-side RNA-seq analysis pipeline. a Installation and configuration. First the GUI transfers the pipeline scripts to the server or utilizes pre-installed scripts if Amazon Cloud instance is being used. Then the pipeline detects installed software and downloads and installs all the analysis tools required for the workflow using an update file on our website which is linked to the current version of CANEapp. After that the pipeline downloads required reference files from ENSEMBL. Reference indexes for STAR and TopHat, as well as gene classification files are prepared in the next step. b Parallel alignment and reconstruction module. Samples are analyzed in parallel; first the reads go through an optional trimming step and are aligned to the genome with either TopHat or STAR. Aligned reads are used to reconstruct transcripts with Cufflinks. This module includes a resource monitor that optimally distributes available resources between subprocesses. c Transcript filtering and classification module. ENSEMBL reference is used to classify genes generated from combining transcript files from all samples. Then the transcripts are filtered to remove potentially spurious single-exon transcripts, and unannotated transcripts and loci are analyzed to predict their ability to code for proteins. d Gene expression and results formatting module. Cuffdiff, edgeR and DESeq2 are used to quantify gene expression and identify differentially expression genes. The pipeline converts output files into fully annotated tab-delimited files, as well as GTF files containing differentially expressed genes. The module also contains primer design scripts that automate primer design for qRT-PCR validation of gene expression
List of software packages and scripts used in CANEapp
| Software name | Function | CANE module |
|---|---|---|
| SRA tools | FASTQ extraction from the SRA file format | Alignment and reconstruction |
| TopHat | Read alignment | Alignment and reconstruction |
| STAR | Read alignment | Alignment and reconstruction |
| Cufflinks |
| Alignment and reconstruction |
| Cuffcompare | Merging transcripts | Transcript filtering and classification |
| Samtools | Nucleotide sequence extraction | Transcript filtering and classification |
| CNCI | Coding potential prediction | Transcript filtering and classification |
| Cuffdiff | Differential expression testing | Gene expression and results formatting |
| HTSeq | Counting reads in loci | Gene expression and results formatting |
| edgeR (R package) | Differential expression testing | Gene expression and results formatting |
| DESeq2 (R package) | Differential expression testing | Gene expression and results formatting |
| Primer 3 | Primer sequence retrieval | Primer design |
Fig. 3Validation of gene expression changes estimated with CANEapp with quantitative real-time PCR. a RNA-seq analysis of hippocampi of Alzheimer’s disease patients and controls. Hippocampal tissue from 4 AD patients and 4 control individuals was used to extract total RNA and perform ribodepletion and strand-specific library preparation. Single-end RNA sequencing was performed on Illumina HiSeq 2000. Fold changes of expression for 2 downregulated and 4 upregulated genes measured with real-time PCR was compared with expression values generated by CANEapp. b RNA-seq of developing mouse cortex. Tissue from 4 embryonic day 17 and 3 adult mouse cortical samples was processed to extract polyA-selected RNA and generate paired-end unidirectional sequencing data with Illumina Genome Analyzer IIx. Gene expression estimates of 4 downregulated and 4 upregulated genes were compared between CANEapp and real-time PCR. c Fold changes of gene expression for RNA-seq of liver of rats treated with two DNA-damage compounds. The data was produced by paired-end sequencing of polyA-selected RNA on Illumina HiSeq 2000. Fold changes of expression for 2 downregulated and 4 upregulated genes were compared between CANEapp and real-time PCR. R2-coefficient of determination
Comparison of CANEapp with previously developed tools for RNA-seq analysis
Note: the green checkmark signifies presence of a feature in a software tool, the red cross means absence of the feature
Description of datasets used to validate CANEapp performance to estimate differential gene expression
| Name | Organism | Experimental groups | N of samples | RNA selection protocol | Library preparation | Single or paired-end | GEO |
|---|---|---|---|---|---|---|---|
| Transcriptomic changes in hippocampi of Alzheimer’s disease patients | Homo sapiens | Alzheimer’s disease vs age- and sex-matched neurologically normal controls | 4 vs 4 | Ribo-depletion | Illumina directional small RNA prep | single | GSE67333 |
| Transcriptomic changes in embryonic and adult mouse cortex | Mus musculus | E17 cortex vs adult cortex | 4 vs 3 | Poly-A selection | Illumina mRNA-Seq prep | paired | GSE39866 |
| SEQC Rat liver toxicogenomics study | Rattus norvegicus | N-Nitrosodimethylamine, Aflatoxin B1 vs Vehicle treatments | 3 vs 3 | Poly-A selection | Illumina TruSeq RNA | paired | GSE55347 |
| 3 vs 4 |
Fold changes of gene expression in three datasets reanalyzed by CANEapp and compared to qRT-PCR results
| Gene Name | Cuffdiff | edgeR_GML | edgeR_et | DESeq2 | QRT-PCR |
|---|---|---|---|---|---|
| Alzheimer’s disease dataset | |||||
| SERPINE1 | 1.41 | 1.71 | 1.54 | 0.98 | 1.66 |
| TAC1 | −1.77 | −1.65 | −1.85 | −1.56 | −0.42 |
| ID2 | 0.98 | 1.07 | 0.88 | −0.73 | 1.17 |
| GRM2 | 0.86 | 0.98 | 0.78 | 0.63 | 0.44 |
| LINC01314 | −0.25 | −1.19 | −1.38 | −1.05 | −0.50 |
| RP11-87E22.2 | 3.63 | 2.01 | 1.85 | 1.31 | 2.30 |
| Mouse cortex dataset | |||||
| Vax1 | −2.12 | −2.02 | −1.74 | −1.71 | −3.18 |
| Caly | 1.93 | 1.79 | 2.08 | 2.09 | 2.10 |
| Igf2bp1 | −9.40 | −8.72 | −8.45 | −8.16 | −5.61 |
| Draxin | −5.98 | −5.26 | −4.98 | −4.94 | −6.80 |
| Nrp1 | −2.17 | −2.22 | −1.94 | −1.92 | −2.46 |
| Ttr | 11.04 | 11.18 | 11.46 | 11.28 | 11.63 |
| Mobp | 12.44 | 12.08 | 12.36 | 12.24 | 12.69 |
| Wipf1 | 2.22 | 1.74 | 2.02 | 2.03 | 1.71 |
| Rat liver dataset | |||||
| Bax-AFL | 1.78 | 1.86 | 1.75 | 1.96 | 2.62 |
| Cdkn1a-AFL | 4.19 | 4.28 | 3.30 | 8.00 | 23.50 |
| Myc-AFL | 0.97 | 1.03 | 0.82 | 0.99 | 2.10 |
| Met-AFL | −1.02 | −0.94 | −0.88 | −0.89 | −1.90 |
| Bax-NIT | 1.62 | 1.22 | 1.19 | 1.25 | 1.98 |
| Cdkn1a-NIT | 3.22 | 2.82 | 2.56 | 3.05 | 8.07 |
| Figf-NIT | 4.18 | 3.76 | 3.51 | 3.72 | 10.27 |
| Fzd4-NIT | −0.30 | −0.70 | −0.69 | −0.73 | −2.07 |
Fig. 4Detection and of novel long noncoding RNAs by CANEapp and their validation by real-time PCR. a Filtering strategies and protein-coding potential prediction. (Right) CANEapp preserves any transcripts that contain a splice junction (a) or single-exon transcripts expressed in a majority of samples (c), whereas single-exon transcripts detected in a minority of samples are filtered out (b). (Center) Loci that have insufficient read coverage are not considered for differential expression testing. (Left) In order to differentiate between novel noncoding RNAs and potential protein-coding genes, each isoform from a novel locus is tested for presence of a significant open reading frame. Loci that contain at least one isoform with an open reading frame are not considered novel noncoding RNA. b Gel electrophoresis image of PCR amplification products for experimentally validated novel long noncoding RNAs. 5 novel antisense RNAs and 3 long intergenic noncoding RNAs (lincRNAs) predicted from the human RNA-seq dataset analysis were amplified with real-time PCR. For mouse cortex dataset, real-time PCR was performed on RNA extracted from adult mouse cortex. 3 antisense RNAs and 5 lincRNAs were successfully validated. c and d Novel long noncoding RNAs span a wide range of expression levels in human and mouse tissues. Relative expression of validated long noncoding RNAs was calculated by normalizing it to the Ct value of the endogenous control beta-actin
Primer sequences for validation of novel noncoding RNAs
| Human Hippocampus | |
| AS1 | Left Primer: ACTGGAGAAGCACGGGGA |
| Right Primer: AAGTTCCACGTGGCTGGG | |
| AS2 | Left Primer: TCGAGCTGAGGACGTGGA |
| Right Primer: TTTCCTGCCTGGCTGGTG | |
| AS3 | Left Primer: TCCCTGTGTGTCTGCACC |
| Right Primer: CCCACACTCAGTTCTTCCCA | |
| AS4 | Left Primer: AGAGCGGTAGGGATACGCT |
| Right Primer: GCTGCTGATGGGTGGTCC | |
| AS5 | Left Primer: CCATGCCTAGCCTCAGGG |
| Right Primer: CTATGTGAGCTTGGGCAAGT | |
| Linc1 | Left Primer: CTGCCCTGTGGAGCATCC |
| Right Primer: CTCTGGCAAGGCGTTCCA | |
| Linc2 | Left Primer: CCTGGCACCGCAGCAA |
| Right Primer: GCTGTCCTAATGCTTCATCCA | |
| Linc5 | Left Primer: CAGGGCCCAGGATCCAGA |
| Right Primer: TGAATTACTGCCACGACCAAG | |
| Mouse Cortex | |
| AS1 | Left Primer: GCCCAGGCTCTCCAGAGA |
| Right Primer: ATAGTCCCTCTCCCCGCC | |
| AS3 | Left Primer: ACGAAAGGGTGCCTTCCC |
| Right Primer: GCTTACTCCCGTCACCCC | |
| AS5 | Left Primer: TTCTTGGACAGCGACCCC |
| Right Primer: AGCGTCAGGAAATGGCCA | |
| Linc1 | Left Primer: TCAGGAGAAGCAGCGTGC |
| Right Primer: TCCTTCTCCAGATCTCAGGGT | |
| Linc2 | Left Primer: TGGTCATGAACTTGTTCCTGT |
| Right Primer: GCCTGGACTCCTATGCTCA | |
| Linc4 | Left Primer: CCAGGAACGGCTGAGACG |
| Right Primer: CTCACAGGCCAGCTGGAG | |
| Linc5 | Left Primer: GCTGCTCCGAGCTCAGTC |
| Right Primer: TTTGGAGCGGTCCTGCAG |