| Literature DB >> 26056424 |
Etienne Kornobis1, Luis Cabellos2, Fernando Aguilar2, Cristina Frías-López3, Julio Rozas3, Jesús Marco2, Rafael Zardoya1.
Abstract
Application of next-generation sequencing (NGS) methods for transcriptome analysis (RNA-seq) has become increasingly accessible in recent years and are of great interest to many biological disciplines including, eg, evolutionary biology, ecology, biomedicine, and computational biology. Although virtually any research group can now obtain RNA-seq data, only a few have the bioinformatics knowledge and computation facilities required for transcriptome analysis. Here, we present TRUFA (TRanscriptome User-Friendly Analysis), an open informatics platform offering a web-based interface that generates the outputs commonly used in de novo RNA-seq analysis and comparative transcriptomics. TRUFA provides a comprehensive service that allows performing dynamically raw read cleaning, transcript assembly, annotation, and expression quantification. Due to the computationally intensive nature of such analyses, TRUFA is highly parallelized and benefits from accessing high-performance computing resources. The complete TRUFA pipeline was validated using four previously published transcriptomic data sets. TRUFA's results for the example datasets showed globally similar results when comparing with the original studies, and performed particularly better when analyzing the green tea dataset. The platform permits analyzing RNA-seq data in a fast, robust, and user-friendly manner. Accounts on TRUFA are provided freely upon request at https://trufa.ifca.es.Entities:
Keywords: RNA-seq; annotation; de novo assembly; expression quantification; read cleaning; transcriptomics
Year: 2015 PMID: 26056424 PMCID: PMC4444131 DOI: 10.4137/EBO.S23873
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1Overview of the TRUFA pipeline.
Figure 2Snapshot of the TRUFA web page for running RNA-seq analysis.
List of available software on TRUFA.
| RNA-SEQ STEPS | AVAILABLE PROGRAMS | VERSIONS |
|---|---|---|
| Read cleaning | PRINSEQ | 0.20.3 |
| CUTADAPT | 1.3 | |
| BLAT | v.35 | |
| Assembly and mapping | Trinity | r2012–06–08 |
| CD-HIT | 4.5.4 | |
| CEGMA | 2.4 | |
| Bowtie | 0.12.8 | |
| Bowtie2 | 2.0.2 | |
| Annotation | BLAT | v.35 |
| HMMER | 3.0 | |
| Blast+ | 2.2.28 | |
| Blast2GO | 2.5.0 | |
| Expression quantification | RSEM | 1.2.8 |
| eXpress | 1.5.1 |
Comparison of outputs between original and TRUFA analyses.
| NO. OF RAW BASES | ||||||||
|---|---|---|---|---|---|---|---|---|
| PESS | PE | PE | PE | |||||
| 544M | 2320M | 5983M | 24740M | |||||
| Pipeline | Trufa | Haas et al (2013) | Trufa | Zhao et al (2011) | Trufa | Xie et al (2014) | Trufa | Zhao et al (2011) |
| No. of bases after cleaning | No cleaning | No cleaning | 2,017M | NA | 5,342M | NA | 5,028M | NA |
| No. of transcripts | 9,370 | 9,299 | 201,892 | 188,950 | 166,512 | 170,880 | 80,999 | 70,906 |
| Mean transcript length | 1,014 | NA | 319 | 332 | 480 | 552 | 847 | 751 |
| No. of bases in the assembly | 9M | NA | 64M | 63M | 80M | 94M | 69M | 53M |
| N50 | 1,585 | 1,585 | 542 | 525 | 1,205 | 1,392 | 2,960 | 2,499 |
| No. of transcripts >1000 nt | 3,680 | NA | 13,276 | 12,495 | 22,317 | 28,578 | 17,251 | 12,511 |
| Total alignment rate | 94.98% | 99.93% | 88.84% | 61.04% | 94.76% | NA | 92.39% | 89.9 |
| Concordant pairs | 92.21% | 93.12% | 74.45% | NA | 87.51% | NA | 84.73% | NA |
Note: Concordant pairs are considered when they report at least one alignment.
Abbreviations: PE, Paired-end; SS, strand-specific; M, million; NA, data not available.
Summary of the de novo annotation step for the four assembled transcriptomes.
| # transcripts | 9,370 | 201,892 | 166,512 | 80,999 |
| # Blast Hits | 8,257 | 72,559 | 66,129 | 29,924 |
| # Annotations | 3,922 | 51,272 | 50,721 | 22,534 |
| % of annotated transcripts | 42% | 25% | 30% | 28% |
| # HMMER hits | 5,588 | 34,689 | 28,736 | 16,552 |
| User time | 11 h | 3 d 19 h | 6 d 8 h | 4 d 15 h |
Notes: # Transcripts, number of transcripts assembled by Trinity; # Blast hits, number of transcripts with at least one hit against the NCBI nr database (e-value <10−6); # Annotation, number of transcripts with at least one annotation after Blast2GO analysis; # HMMER hits, number of transcripts with at least one hit against the Pfam A database (e-value <10−6); User time, time needed to perform the complete pipeline (cleaning, assembly, annotation, and expression quantification).
Figure 3Measures of completeness and read usage for the assemblies produced with TRUFA. CEGMA results represent the percentage of completely and partially recovered genes in the assemblies for a subset of 248 highly conserved core eukaryotic genes. Overall alignment rate and concordant pairs (providing at least one alignment) were computed with Bowtie2.