| Literature DB >> 34197621 |
François Bucchini1,2, Andrea Del Cortona1,2, Łukasz Kreft3, Alexander Botzki3, Michiel Van Bel1,2, Klaas Vandepoele1,2,4.
Abstract
Advances in high-throughput sequencing have resulted in a massive increase of RNA-Seq transcriptome data. However, the promise of rapid gene expression profiling in a specific tissue, condition, unicellular organism or microbial community comes with new computational challenges. Owing to the limited availability of well-resolved reference genomes, de novo assembled (meta)transcriptomes have emerged as popular tools for investigating the gene repertoire of previously uncharacterized organisms. Yet, despite their potential, these datasets often contain fragmented or contaminant sequences, and their analysis remains difficult. To alleviate some of these challenges, we developed TRAPID 2.0, a web application for the fast and efficient processing of assembled transcriptome data. The initial processing phase performs a global characterization of the input data, providing each transcript with several layers of annotation, comprising structural, functional, and taxonomic information. The exploratory phase enables downstream analyses from the web application. Available analyses include the assessment of gene space completeness, the functional analysis and comparison of transcript subsets, and the study of transcripts in an evolutionary context. A comparison with similar tools highlights TRAPID's unique features. Finally, analyses performed within TRAPID 2.0 are complemented by interactive data visualizations, facilitating the extraction of new biological insights, as demonstrated with diatom community metatranscriptomes.Entities:
Mesh:
Year: 2021 PMID: 34197621 PMCID: PMC8464036 DOI: 10.1093/nar/gkab565
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.TRAPID 2.0 workflow and functionality overview. TRAPID’s workflow comprises two main parts: an initial processing phase (1), executed non-interactively after data upload, and an exploratory phase that consists of a set of functional and comparative tools (2) accessible through the web application. Input and output data are represented by cyan boxes. Output data boxes are complemented by thumbnails that depict the visualizations available from the web application. Available reference databases (consisting of functionally annotated proteomes and gene families) are represented by orange cylinders. Analysis and computation steps are represented by grey boxes and solid arrows.
Figure 2.Ostreococcus mediterraneus transcriptome taxonomic classification results visualization. (A) Krona sunburst chart. (B) Tree viewer: users can explore the classification of transcripts along the tree of life, enabling in-depth investigation of the results. (C) Sample composition bar charts: domain-level composition is shown on the left and the top ten most represented clades at a given taxonomic rank (here genus) on the right, providing an overview of the results. Transcript subsets can be defined using any of the available visualizations for further analysis. These results were generated using MMETSP0929 (Ostreococcus mediterraneus clade-D-RCC2572).
Figure 3.Genus-level taxonomic classification, core eukaryotic completeness, and transcript count of 26 microbial eukaryote transcriptomes. (A) Genus-level taxonomic classification summary. For each transcriptome, only up to 10 top represented genera (genera to which the most transcripts are assigned) are shown. Transcripts not assigned to any taxonomic label are represented as ‘Unclassified’ (dark grey fraction), and transcripts assigned to other genera or genera encompassing <1% of the depicted transcripts are aggregated as ‘Other’ (light grey fraction). Transcript sequences classified to a rank above genus-level are not represented. The circle on the right of each classification summary indicates the number of top represented genera additionally supported by the classification of either SSU or LSU rRNA sequences detected by TRAPID 2.0 (see Materials and Methods). (B) Core eukaryotic completeness score, computed using 1116 ‘core’ eggNOG orthologous groups conserved in at least 90% of the eukaryote organisms present in the database (red dots, top x-axis), and transcript count (blue bars, bottom x-axis). The genus label associated to each depicted transcriptome was retrieved from the MMETSP metadata. These results were generated using 26 MMETSP samples processed with eggNOG 4.5 as a reference database and default initial processing parameters.
Figure 4.Impact of non-canonical genetic code use during ORF prediction for ciliate transcriptomes. (A) Histogram of predicted ORF sequence length divided by best sequence similarity search hit length (‘best hit recovery ratio’) for 257 454 sequences from 16 ciliate MMETSP samples having homology support, using the standard or the ciliate nuclear genetic code for ORF prediction (translation table 1 or 6 respectively). For each genetic code, the distribution of ratio values is depicted as a rug plot, and the mean value represented by a dashed line. (B) Multiple sequence alignment of two transcripts from MMETSP0018 (Uronema sp. Bbcil) assigned to the ‘0IF5I’ eggNOG orthologous group with reference sequences from Alveolata. ORF sequences were predicted using either the standard or the ciliate nuclear genetic code (corresponding to blue and orange sequence labels, respectively). Amino acid residues are shaded based on the chemical properties of their functional groups, with orange circles indicating stop codons reassigned to glutamine. Sequence label prefixes were trimmed to improve legibility.
Figure 5.Functional enrichment analysis of a subset of Ostreococcus mediterraneus metal ion transport transcripts. (A) Metal ion transport transcripts InterPro enrichment results. InterPro identifiers are represented on the x-axis, enrichment fold on the left y-axis (red bars), and enrichment q-value on the right y-axis (dark grey dots). (B) Sankey diagram depicting the relationships between metal transport transcripts (left blocks), significantly enriched InterPro domains (middle blocks), and PLAZA gene families (right blocks). Line width is proportional to transcript annotation (left lines) and GF membership (right lines). The maximum enrichment q-value threshold is 1e-5 and only gene families corresponding to at least two enrichment records (transcript-function pairs) are displayed. These results were generated using MMETSP0936 (Ostreococcus mediterraneus clade-D-RCC2573) with pico-PLAZA 3.0 as a reference database and default initial processing parameters.
Transcriptome annotation and analysis platform feature comparison. The abbreviations are as follows: BiGG metabolic reactions (BiGG), Conserved Domains Database (CDD), Command-line Interface (CLI), Differential Expression Analysis (DEA), Enzyme Commission number (EC), Gene Ontology terms (GO), Graphical User Interface (GUI), InterPro (IPR), KEGG Orthology groups (KO), NCBI BLAST non-redundant protein database (NCBI nr), and Protein Domains (PD)
| Features | Blast2GOa | KAAS | eggNOG-mapper (v1) | Trinotate/trinotate-web | EnTAP | Annocript | Dammit | TRAPID 2.0 |
|---|---|---|---|---|---|---|---|---|
|
| BLAST | BLAST | DIAMOND or HMMer | BLAST, HMMER | DIAMOND | BLAST, rpsBLAST | HMMER, LAST | DIAMOND |
|
| No | No | No | TransDecoder | GeneMarkS-T, TransDecoder | DNA2PEP | TransDecoder | Homology-supported/ |
|
| Public databases | Curated KEGG genes | eggNOG 4.5 | Pfam, Swiss-Prot | Up to 5 user-selected protein databases, eggNOG 4.5 | UniProt (or UniRef), CDD, TrEMBL, user protein databases | Pfam, OrthoDB, Uniref90, user protein databases | PLAZA 4.5 monocots/dicots, pico-PLAZA 3.0, PLAZA diatoms 1.0, eggNOG 4.5 |
|
| No | No | Yes | Yes | Yes | No | No | Yes |
|
| GO, PD (IPR), EC, KEGG | KO | GO, KO, BiGG, COG functional categories | GO, PD (Pfam), KEGG | GO, PD (IPR), KO, BiGG, COG functional categories | GO, PD (Pfam) | PD (Pfam) | GO, PD (IPR), KO |
|
| No | No | No | RNAMMER (rRNAs) | No | BLASTn with SILVA/Rfam, lncRNAs | Infernal with Rfam | Infernal with Rfam |
|
| No | No | No | DEA, clustering based on expression (RSEM)b | Filtering of lowly expressed transcripts (RSEM) | DEA (edgeR) | No | No |
|
| Yes | No | No | Yesb | No | No | No | Yes |
|
| No | No | No | No | No | No | No | Kaiju with NCBI |
|
| No | No | Noc | No | No | No | No | MUSCLE, MAFFT |
|
| No | No | Noc | No | No | No | No | FastTree2, IQ-TREE, PhyML, RaxML |
|
| No | No | No | No | No | No | BUSCO | Core GF completeness |
|
| GUI | GUI | GUI and CLI | GUI and CLI | CLI | CLI | CLI | GUI |
|
| Blast2GO Apps | Graphical pathway maps | Predicted gene names, free text functional descriptions | Transmembrane region and signal peptide prediction (TMHMM, SignalP) | Similarity search filtering: Taxonomic Favoring and Contaminant Filtering | HTML summary (plots and statistics) | Ortholog detection using CRBL | Putative frameshift detection, ORF length meta-annotation, interactive analysis of transcript subsets |
aBlast2GO basic version.
bAvailable as utility scripts from the Trinity package.
cMSAs and phylogenetic trees are available for each eggNOG group, externally of eggNOG-mapper.
Figure 6.Diatom-rich communities metatranscriptome taxonomic classification and diatom-assigned transcript subsets InterPro enrichment results. (A) Domain-level taxonomic profile of the global metatranscriptome. 4734 transcript sequences assigned to ‘cellular organisms’ or the root node of the taxonomy are not represented. (B) Genus-level taxonomic classification summary of transcripts expressed in each of the three sampling sites. For each sample, the fraction of expressed transcripts assigned to the top 10 represented genera (the genera to which the most transcripts were assigned overall) is shown. Transcripts assigned to other less represented genera are aggregated as ‘Other’ (light grey fraction), and transcripts not assigned to any genus are not displayed. The numerical values complementing the sample identifiers indicate the ratio of transcripts classified at the genus-level over the total amount of expressed transcripts. (C) Heat map showing the 25 most enriched InterPro domains per subset of diatom-assigned transcripts expressed in each sampling site, compared with the global metatranscriptome (maximum enrichment q-value 0.01). Transcript subsets are represented along the x-axis and enriched InterPro domains along the y-axis. Transcripts associated to a subset and an enriched InterPro domain are binned by assigned genus (three most represented diatom genera and ‘Other’). For each combination of subset, enriched InterPro domain, and assigned genus, the circle size is proportional to the ‘taxonomic ratio’, a ratio of the frequency of the subset's transcripts associated to the InterPro domain and assigned to the genus over the observed frequency for all the transcripts expressed in the sampling site. Enriched InterPro domains were assigned to broad functional categories, indicated by row annotations, and equivalent enrichment results were filtered to reduce redundancy. The significance of the enrichment is depicted as a color gradient, and each column is annotated with sample and genus classification information.