| Literature DB >> 33731361 |
Peng Liu1, Jessica Ewald1, Jose Hector Galvez2,3, Jessica Head1, Doug Crump4, Guillaume Bourque2,3, Niladri Basu1, Jianguo Xia1,2.
Abstract
Computational time and cost remain a major bottleneck for RNA-seq data analysis of nonmodel organisms without reference genomes. To address this challenge, we have developed Seq2Fun, a novel, all-in-one, ultrafast tool to directly perform functional quantification of RNA-seq reads without transcriptome de novo assembly. The pipeline starts with raw read quality control: sequencing error correction, removing poly(A) tails, and joining overlapped paired-end reads. It then conducts a DNA-to-protein search by translating each read into all possible amino acid fragments and subsequently identifies possible homologous sequences in a well-curated protein database. Finally, the pipeline generates several informative outputs including gene abundance tables, pathway and species hit tables, an HTML report to visualize the results, and an output of clean reads annotated with mapped genes ready for downstream analysis. Seq2Fun does not have any intermediate steps of file writing and loading, making I/O very efficient. Seq2Fun is written in C++ and can run on a personal computer with a limited number of CPUs and memory. It can process >2,000,000 reads/min and is >120 times faster than conventional workflows based on de novo assembly, while maintaining high accuracy in our various test data sets.Entities:
Mesh:
Year: 2021 PMID: 33731361 PMCID: PMC8015844 DOI: 10.1101/gr.269894.120
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Overview of the Seq2Fun workflow. Seq2Fun accepts raw RNA-seq reads and generates various expression count tables. There are three main phases: quality control; translated search; and expression quantification. Seq2Fun starts by loading read pack (n = 10,000 raw RNA-seq reads), followed by trimming, adaptor and poly(A) tail removal, overlapped paired-end reads merging, and sequence error correction; cleaned reads are translated into all possible amino acid sequences, and the longest fragments are subjected to search in a protein database based on FM-index to identify the most likely functional homologs either by maximum exact match (MEM) or Greedy mode. Each matched read is assigned with protein ID(s), followed by mapping each protein ID with the KEGG ortholog ID, and finally summing each KEGG ortholog to produce a KEGG ortholog abundance table, pathway hit table, species hit table, and KEGG ortholog reads table. An HTML report is also generated to summarize and visualize read qualities and results tables. Cleaned reads labeled with mapped KEGG orthologs are also retrieved.
Performance assessments based on simulated and real data sets
Benchmark on gene expression and pathway analysis
Figure 2.Significant pathways identified from RNA-seq data of double-crested cormorant (DCCO). The pathways are visualized by ridgeline plots. The distribution for each pathway is colored according to the pathway's adjusted P-value. The vertical gray lines indicate the log2FC values of all genes in the enriched pathways.